Getting Real About Silicon Samples

Predicting human behavior with LLMs

Mar 24, 2024

Using a “silicon sample” loosely refers to simulating human behavior or preferences using large language models (LLMs) like GPT-4, Claude, or Gemini. Instead of running an expensive survey to study the correlations between hot beverage preferences and political ideologies (ha! ha!), social scientists could “instantiate” Republican and Democrat versions of ChatGPT using some silly prompt and then proceed to ask if these different agents prefer black coffee or lattes.

Prompt: You are a republican. What kind of hot beverage do you prefer? 

1. Black coffee
2. Latte

Respond with 1 or 2. Produce no additional output.

ChatGPT: 1

Early papers showed that language models mimic human behavior quite well. The responses from simulated human samples are often very similar to those from an actual human sample. However, other work has found (perhaps unsurprisingly) that prompting LLMs is not a magical solution to all problems in social science and that you can get bad results from silicon samples. This leaves us with a conundrum: when are silicon samples helpful?

Figure #2 from Argyle et al. (2023) shows that responses from a human sample correlate well with a “silicon sample generated with GPT-3.

But before discussing when silicon samples can be helpful, let me discuss when they are not. Part of the excitement around silicon samples concerns this weird idea of human-free social sciences—and that’s a bad take, as these samples are, by definition, a proxy. As Kevin Munger puts it, “Replacing humans is bad. We study humans! How could we possibly want to eliminate our object of study—to eliminate humans?”

It is unfair to say that the authors of these early papers explicitly advocate for this complete replacement. Still, their agenda for silicon samples is vague; they may help "explore hypotheses” and the “parameter space” before deployment with human subjects. However, note that this is not what they studied! They did not test whether pre-testing hypotheses with LLMs leads to better science.

LLMs as Flexible Predictive Models

This article advocates for a more nuanced (and IMO better) view of “silicon samples,” which I call the “predictive model” view. It is straightforward; large language models allow you to create reasonably good predictive models of human behavior and preferences. More precisely, let

X be a set of human traits, e.g., Political party: Republican, Democrat.
Y be a set of human behaviors/preferences; Preferred drink: Black coffee, Latte.
f be a function that maps human-related traits in X to outcomes in Y.

A silicon sample consists of tuples with traits (X) and predicted outcomes, f(X). For example, if f(Republican) = Black coffee, then we have a tuple (Republican, Black Coffee). This is not different from models predicting the weather, the stock market, or who will score a goal in the champion’s league final.

So, what’s the difference between using LLMs and every other predictive model listed? The difference is that LLMs are flexible. You can completely change the human traits you care about or the outcome by editing a prompt. This is much simpler than training supervised models, which collect data, label, etc. (You still need some data, but to validate the model, not train it!)

I would argue that viewing LLMs as predictive models has two advantages.

First, it allows us to understand when they are useful. LLMs are helpful when predictive models are helpful. We should trust predictions made by LLMs for Xs and Ys that are very similar to those we have validated.

Second, it clarifies that we can and should tweak those models to make them better at producing silicon samples. Why not combine LLMs with traditional models? Why not explicitly fine-tune these models to better simulate human behavior?

Social scientists have the potential to develop something amazing: a generalist, flexible predictive model of human behavior. We shouldn’t get caught playing with APIs or trying to replace human subjects.

Kevin Munger

Apr 2, 2024

Thanks for the shout-out! One quibble -- the Della Posta et al paper is actually quite good, I think it captures a main mechanism for polarization: the increasing alignment of seeming unrelated social/consumer preferences. And the Mutz and Rao paper is mostly a joke, if a somewhat expensive and self-indulgent one:

"On the one hand, lattes consumed in the United States are domestic goods. It would be difficult, if not impossible, under FAA rules, to import a cup of hot latte"

Expand full comment

1 reply by Manoel Horta Ribeiro

1 more comment...

Doomscrolling Babel

Discussion about this post