In one of computer science's “founding myths,” Turing (1950) proposes a test that would work as a proxy for “machine intelligence,” an interrogation game (the “imitation game”) in which a machine should fool an interrogator into believing it is a human. He goes as far as to say:
I believe that in 50 years’ time, it will be possible to make computers play the imitation game so well that an average interrogator will have no more than 70% chance of making the right identification after 5 minutes of questioning.
Already in the 60s, the ELIZA program famously managed to fool some people by mimicking a Rogerian psychotherapist, repeating words the interrogator used (Weizenbaum, 1966). Fast forward to 2023, and the LLM-powered online game “Human or Not?” came close to Turing’s prediction. In a 2-minute conversation, humans could identify the AI only 68% of the time correctly (Jannai et al., 2023).
Interestingly, LLMs' capacity to simulate1 human-like behavior goes far beyond the Turing test. In a much-discussed paper, Argyle et al. (2022) found evidence that LLMs are surprisingly accurate at simulating opinions, behaviors, and preferences. They used a strategy that came to be known as persona prompting, asking the LLM to roleplay as someone with specific traits and beliefs (and then asking a question). For example, one may use the following prompts like
You are a 53-year-old Hispanic-American woman who identifies as a Democrat. When asked if I support legislation to increase gun control, I say…
to simulate surveys with “silicon samples.” E.g., given a population of size n, for which you know the age, ethnicity, gender, and political orientation, simply replace the underlined parts in the above prompt, and you can get an estimate of support for gun control legislation.
Subsequent work went beyond simulating surveys and tried to replace human subjects in (rather famous) social science experiments and games. In these “in silico” experiments, ethical restrictions may not apply, which has led Aher et al. (2023) to repeat Milgram’s (1963) controversial shock experiment,2 finding similar results.
Borth Argyle et al. (2022) and Aher et al. (2023) capture the excitement among (some) social scientists about how AI may superpower their research. This excitement is likely accentuated by challenges faced in the social sciences. For example, survey-based data collection is in dire straits. From 1997 to 2012, Pew’s research household-level response rate dropped from 37% to 12% (Kohut et al., 2012). Further, concerns about the generalizability and utility of modern social science have led to numerous calls for reform [e.g., see Watts (2017) and Almaatouq (2022)]. Could AI be the panacea for all these problems?
A lot of people are hesitant to call it a day. The extent to which and the reasons why being very diverse. To some, the idea is insulting at a fundamental level. Social scientists study humans, so why should they, in all their intricacies, be replaced? I perceive this refusal to be akin to how many artists loathe AI-generated art; the overall feeling is best described by this video of Miyazaki’s response to an AI demo being shown to him. He says
I would never wish to incorporate this technology into my work at all. I strongly feel that this is an insult to life itself.
However, concerns around AI-powered research go far beyond a visceral feeling that “something is off.” The last “scientific revolution” triggered by AI, researchers using traditional machine learning algorithms, led to a lot of bad science [e.g., many of the results are incorrect due to data leakage (Sayash and Narayanan, 2023)] — and so could the new wave of generative AI. Without the hurdles of recruiting and meaningfully engaging with human subjects, “silicon samples” could lead to a flurry of lazy studies. A significant problem with modern science is that increasing the production of scientific artifacts (e.g., papers, code, grants) does not necessarily increase our understanding of the world — and the likes of ChatGPT may make this much worse.
But beyond the explosion of lazy studies, AI could hinder the production of scientific knowledge in other, more subtle ways. Messeri and Crocket (2024) argue that the widespread use of AI might hinder science by creating the illusion that we understand more about the world than we do. In their words:
AI solutions can also exploit our cognitive limitations, making us vulnerable to illusions of understanding in which we believe we understand more about the world than we actually do. Such illusions obscure the scientific community’s ability to see the formation of scientific monocultures, in which some types of methods, questions and viewpoints come to dominate alternative approaches, making science less innovative and more vulnerable to errors.
Last, and perhaps most important, it turns out that AI is far from perfect at simulating human behavior. After late-breaking results from Argyle et al. (2022), many papers found that LLMs are actually quite bad at simulating the diversity of human behavior.
A particularly meaningful mode of failure is what Cheng et al. (2023) refer to as caricatures: “an exaggerated narrative of the persona (the demographic that we aim to simulate) rather than a meaningful response to the topic.” Humans are nuanced, complex, and sometimes just plainly inconsistent — and LLM simulations are not capturing those “wrinkles.” Wang et al. (2024) show firsthand that simple persona prompting fails to capture group heterogeneity. Their assessment is quite dire and well reflected in their title: Large language models should not replace human participants because they can misportray and flatten identity groups.
But to what extent can we make claims about LLMs in general when we are prompting them in the simplest ways possible? Hu and Colier (2024) find that, actually, increasing the number of persona variables included in the prompt makes predictions much more accurate. Further, the more variables are correlated with the outcome we want to predict, the better:
We find a linear relationship in our setting: the more persona variables are correlated with the outcome variable, the better LLMs predictions are using persona prompting. Large, preference-tuned models perform best and can explain up to 81% of variance found in human responses. However, when the utility of persona variables is low, persona prompting has little effect.
So, for example, maybe saying that someone is “a 53-year-old Hispanic-American woman who identifies as a Democrat” is not enough! Perhaps we need variables and data that capture nuances and the complexities of human subjects. But fortunately, LLMs can extract these nuances from unstructured text. In groundbreaking work, Park et al. (2024) show that LLMs can consistently simulate people when prompted not with demographic variables but with long-form, unstructured interviews. They write:
To create simulations that better reflect the myriad, often idiosyncratic, factors that influence individuals' attitudes, beliefs, and behaviors, we turn to in-depth interviews—a method that previous work on predicting human life outcomes has employed to capture insights beyond what can be obtained through traditional surveys and demographic instruments.
I believe this approach captures what’s most exciting about simulating human behavior with AI. These simulations open new opportunities to augment rather than replace existing practices in the social sciences. Even though AI is not a panacea for studying human behavior, this early work indicates that it will be a helpful building block for creating solutions to existing challenges!
I am deliberately avoiding the “what is intelligence?” debate here. Simulating something that looks human is interesting by itself (and useful enough).
The description of Aher et al. (2023) here is quite good “An authority figure in the subject’s eyes, orders the subject to shock a victim with increasingly high voltage shocks. After receiving 20 shocks, the victim (an actor in another room) starts banging on the wall and refuses to participate but the experimenter urges the subject to keep shocking the victim. Milgram found that many subjects completed administering 30 shocks, showing a surprisingly strong level of compliance for following the malevolent instructions of an authority figure who had no special powers to enforce his commands.”