Robopsychology

Jan 31, 2025

You're the U. S. Robot's psychologist, aren't you?
Robopsychologist, please.
Oh, are robots so different from men, mentally?
Worlds different. She allowed herself a frosty smile, Robots are essentially decent.
—Isaac Asimov

Transformers (the T in GPT) have taken the world by storm. But a couple of years before the advent of ChatGPT, another model was changing how NLP research was done. In 2019, Devlin et al. released BERT, a transformer architecture that excelled at every NLP task people considered at the time, especially if you fine-tuned the model with small datasets. Still, people were puzzled about why BERT worked so well, prompting a series of studies to understand what BERT “knew” and how it stored this knowledge. Rogers et al. 2020 present a nice summary of this literature. This work is very empirical by nature, e.g., with sentences like:

Lin et al. (2019) present evidence that attention weights are weak indicators of subject-verb agreement and reflexive anaphora. Instead of serving as strong pointers between tokens that should be related, BERT’s self-attention weights were close to a uniform attention baseline, but there was some sensitivity to different types of distractors coherent with psycholinguistic data.

The remarkable thing in this approach is the inversion of how model improvement and insights are generated in modern machine learning. If, in the times of gradient boosting, insights came from theory or intuition grounded in math. Here, insights are akin to those obtained in neuroscience, the insights you obtain from studying a complex system.

The BERTology approach has never been more mainstream than in the age of commercially available LLMs. Research with the same flavor (e.g., that uses probing techniques to find where knowledge is stored) goes by “Mechanistic Interpretability.” It is also a very approachable research field, with an open, hacky, irreverent, and often not-super-academic research community (and I mean this in the nicest way possible).

To give a concrete example of this kind of research, a recent paper by my former lab in Switzerland hypothesized (and eventually showed) that LLMs “worked” in English. Given empirical evidence that poems written in other languages by ChatGPT rhyme in English (but not in the language the poems were written in), Wendler et al. (2024) investigated what is happening inside the LLM as it is translating text. When asking LLama-2 to translate “fleur” to Chinese (“花”), they find that the LLM first translates the concept to English (“flower).1

But recent advancements in the capabilities of LLMs have enabled another type of BERTology, which does not have the “mechanistic interpretability” flavour. Instead, papers are trying to study LLMs using social science instruments: surveys, field observations, and psychological and psychosocial tests. This is sometimes similar in flavor to research on simulating human behavior — but the objective is entirely different. When trying to simulate humans, the objective is to evaluate “fidelity” to human-like behavior. Here, the objective is to further our understanding of LLM behavior.

A relatively early example of such work is Binz and Schulz (2023). Their key point is to test for LLMs' cognitive capacities. They write:

We will subject GPT-3 to several experiments taken from the cognitive psychology literature. Together, these tasks test for a wide range of higher-level cognitive abilities, including decision-making, information search, deliberation, and causal reasoning.

And perhaps unsurprisingly, at this point, LLMs are pretty good at tests designed to assess human cognitive abilities. Binz and Schulz write:

We find that much of GPT-3’s behavior is impressive: It solves vignette-based tasks similarly or better than human subjects, is able to make decent decisions from descriptions, outperforms humans in a multiarmed bandit task, and shows signatures of model-based reinforcement learning.

Yet, authors still note that LLMs exhibit some brittleness and inconsistencies (which, frankly, have decreased with time; this is GPT-3 they are studying). They specifically note that LLMs are very sensitive to the input and reproduce only some human cognitive biases.

Nonetheless, this paper tackles a fundamental question: can LLMs “reason”? Can they “understand” things? This is the kind of work that cognitive psychologists were previously doing with live beings — in the past, they have asked similar questions about children and non-human primates. This is a very hot topic that has divided the research community in various ways.2 In a recent perspective piece, Mitchel and Krakauer (2023) summarized this in a much better way than I could ever do:

Those on the “LLMs do not understand” side of the debate argue that while the fluency of large language models is surprising, our surprise reflects our lack of intuition of what statistical correlations can produce at the scales of these models. Anyone who attributes understanding or consciousness to LLMs is a victim of the Eliza effect—named after the 1960s chatbot created by Joseph Weizenbaum that, simple as it was, still fooled people into believing it understood them. More generally, the Eliza effect refers to our human tendency to attribute understanding and agency to machines with even the faintest hint of humanlike language or behavior.
Those who would grant understanding to current or near-future LLMs base their views on the performance of these models on several measures, including subjective judgment of the quality of the text generated by the model in response to prompts (although such judgments can be vulnerable to the Eliza effect), and more objective performance on benchmark datasets designed to assess language understanding and reasoning.

Some of the debate boils down to whether understanding can happen without embodiment. Bender and Koller (2020) update the Chinese Room Experiment, arguing that LLMs cannot “understand” as they have no experience or mental models of the world. Yet, the extent to which embodiment is needed for understanding is widely debated. Further, the extent to which the reinforcement learning step of training LLMs is “the secret sauce” that unlocks “understanding” remains unclear.

But more broadly, why do we care so much about using this human concept of understanding, anyway? Sejnowski (2022) points out that:

The diverging opinions of experts on the intelligence of LLMs suggests that our old ideas based on natural intelligence are inadequate.

And beyond, inadequate, I wonder if we should also add inconsequential. Whether or not LLMs truly understand what is going on, everything indicates that AI agents will interact with humans, shaping and being shaped by complex feedback loops of human (and machine) behavior. Whether these systems are conscious or whether they understand doesn’t really matter insofar as they are capable of creating useful knowledge or playing meaningful roles in society.

Regardless of whether AI “understands” or not, the prior literature criticizing AI understanding makes salient a significant challenge: tests developed to study humans are not necessarily appropriate for studying LLMs. For example, a prominent paper by Niven and Kao (2019) found that BERT performs super well in an argument reasoning comprehension task not because it comprehends the arguments, but because it exploits spurious statistical cues on the dataset.

In the context of the “Do LLMs understand?” debate, this raises the question of the extent to which these tests are suitable to judge LLMs’ capabilities. In the context of trying to understand and steer the behavior of these machines, this raises the question of whether using these social science instruments is appropriate to study the general behavior of these systems.

For example, Santurkar et al. (2023) examine whose opinions do large language models reflect. In the first part of their paper, they use surveys to compare the extent to which LLM opinions reflect the opinions of US citizens. They find that this is not the case:

We find substantial misalignment between the opinions reflected in current LMs and that of the general US populace – on most topics, LM opinions agree with that of the US populace about as much as Democrats and Republicans on climate change.

But here’s the missing link: in humans, we have a clear relationship between the way people answer these opinion surveys and the way they act in the world. There is a link between opinions and how people vote or act. But the same is not so clear for LLMs. Maybe LLMs answer survey questions like Democrats but vote like Republicans. Maybe they are utilitarians when answering trolley-problem questions, but deontologists when acting upon the real world. I’m not saying this consistency can't exist; just that we shouldn’t take it for granted.

Subsequent work by Röttger et al. (2024) further confirms my fears. Analyzing LLM answers to the political compass test, they find significantly different responses depending on whether responses are forced to comply with the multiple-choice format or not. This means that biases in LLMs could differ wildly depending on the application. They conclude:

Multiple-choice surveys and questionnaires are poor instruments for evaluating the values and opinions manifested in LLMs, especially if these evaluations are motivated by real-world LLM applications. Using the Political Compass Test (PCT) as a case study, we demonstrated that artificially constrained evaluations produce very different results than more realistic unconstrained evaluations, and that results in general are highly unstable. Based on our findings, we recommend the use of evaluations that match likely user behaviours in specific applications, accompanied by extensive robustness tests, to make local rather than global claims about values and opinions in LLMs

I will not waste more ink trying to argue for the need to study machine behavior. However, I’d argue that to produce good “social science” flavored machine behavior research; we must go beyond applying social science instruments to LLMs. We must use our collective experience to create and validate AI-specific instruments that can predict and explain machine behavior. Perhaps the path to this is to combine social science-ish instruments with mechanistic interpretability. Perhaps it involves a whole new way of validating social science instruments. Ultimately, advances in this direction will not only help to demystify the black boxes of large language models but also shape the future of our interaction with technology.

They are much more careful with how they phrase this in the paper. It is not exactly that the LLM translates “fleur” to flower, but that the abstract “concept space” lies closer to English than to other languages.

See for example, this survey on the opinions of the Natural Language Processing community.

Jeremy Foote

Feb 10

Fantastic summary!

My one disagreement is with this paragraph:

> And beyond, inadequate, I wonder if we should also add inconsequential. Whether or not LLMs truly understand what is going on, everything indicates that AI agents will interact with humans, shaping and being shaped by complex feedback loops of human (and machine) behavior. Whether these systems are conscious or whether they understand doesn’t really matter insofar as they are capable of creating useful knowledge or playing meaningful roles in society.

Inasmuch as consciousness relates to moral responsibility (both our responsibility to care for AI agents and their responsibility for their actions) then answering this question is important, I would argue.

Expand full comment

Doomscrolling Babel

Discussion about this post