Doomscrolling Babel

Qualitative Research After the Interviewer Bot

Manoel Horta Ribeiro — Sat, 18 Apr 2026 13:16:53 GMT

I have a new hot take. The first half of it is perhaps not so hot, and it would probably be a consensus on Bluesky: AI cannot replace qualitative research. As the letter from prominent qualitative researchers argues, meaning-making is a human activity, and even if LLMs could do it, it is unclear how useful it would be. More broadly, I’d argue that qualitative research is what qualitative researchers do — and at least for now, they are not buying the AI pill. The second part is where things may get controversial. People will use Generative AI to answer questions previously reserved for qualitative research, and they will do so within a quantitative paradigm preferred by institutions. And before you throw the first stone, this is a descriptive claim, not a normative one.

But let me take a step back to define what I mean by “qualitative research,” because my use of the term is inevitably shaped by my work in HCI and other CS-y fields. The tradition I have in mind includes things like ethnography, participant observation, interviewing, interpretive content analysis, and grounded theory. These are sometimes referred to as “interpretative” in contrast to “positivist” methods. The acceptance of interpretative methods in tech-adjacent fields was (and still is) uneven and bumpy, but it has happened because of their great utility for making sense of phenomena that do not yield easily to logs or math alone. And it feels like these phenomena are ever-relevant, because important questions seem to be at the edge of what our institutions can quantify: stigma, domestic violence, emotional labor.

But the mainstream institutional acceptance of qualitative research often rests on the limits of quantitative approaches. Institutions have not truly bought the “thick-description” pill. They seek it when quantification is not yet available, credible, or cheap, but are quick to pivot once that is no longer true. We see this pattern across the most diverse areas, from human rights monitoring to child protection, stakeholders’ preferences for numbers displaced much interpretive work in favor of a couple of metrics and decision trees. And it is worth understanding their viewpoint: they have to navigate complicated trade-offs, and that is arguably easier to do once you have simplified your world into a handful of indicators.

My point here is not to turn this into a discussion of the epistemic validity of different methods, but to give context on what advances in GenAI mean for qualitative research (maybe beyond my modest talents, but hey, this is a blog post). Even if not referred to as “Qualitative AI research,” new technologies enabled by GenAI such as open-ended interviewing at scale, or semi-automated inductive coding, enable quantitative approaches to examine questions that were previously too much horse for quant research’s cowboy. Consider, for example, the recent hype around the “Anthropic Interviewer.” The system is no ethnographer in any deep sense, but it can collect reasonably good interviews cheaply and at scale. And that’s probably preferable for many institutions: they can cheaply get something that fills the same gap as qualitative inquiry but in a form more amenable to scale and quantification.

One potential criticism of my point is that no questions are inherently “qualitative.” Fair enough. But in practice, institutions and firms usually behave as if the distinction exists. They distinguish between questions they see as problems of measurement and optimization, and questions they see as requiring interpretation, context, and meaning-making. A health agency, for instance, would not approach estimating vaccine effectiveness in the same way it approaches understanding why some communities resist vaccination. But what happens when that second task begins to look operationalizable at scale, say, through a Claude-backed interviewer?

What is most interesting is how hard it is to criticize this approach while staying within an interpretivist frame. There are genuine concerns about AI-conducted interviews. But the critiques most likely to persuade institutional stakeholders often appeal to concepts like reliability, validity, and generalizability, terms that interpretivist traditions have often resisted. This means the threat model here is that AI may reorganize how institutions understand the value of qualitative inquiry.

As a quantitative researcher myself (who got chastised for being a “positivist” only once! I was a master’s student, give me a break), I find these new developments exciting but at the same time slightly terrifying. These technological advances may help address important questions at scale and with the kinds of guarantees that institutions care about: consistency, comparability, auditability, and speed. But they force harder questions: who gets to decide when an AI-informed account of a phenomenon is “good enough”? And whose interests are served when that threshold is set? As Messeri and Crockett remind us, “increasing productivity does not guarantee an improved understanding of the world.”

Brandon Sanderson’s Case Against AI Art

Manoel Horta Ribeiro — Sun, 15 Feb 2026 20:22:47 GMT

I’ve spent a significant chunk of my life reading—and, more recently, listening to—fantasy and sci-fi. That’s probably why I clicked so quickly when YouTube recommended a video of Brandon Sanderson, an author I admire (and who, if I’m honest, is too productive for me to keep up with), talking through his take on AI art. The video is very good, and I highly recommend you watch Brandon read his essay, even though it contains no chasmfiends or mistwraiths.

Brandon’s essay shows intellectual humility, which I find lacking in the AI Art discourse (a genre that tends to oscillate between victory laps and moral panic). He starts talking about how Roger Ebert has infamously said that “video games are not art.” A stance he and his nerdy audience take as prima facie false. He goes on to note that Ebert is no pioneer in calling bullshit on whatever new form of art emerges: earlier iterations included poets dunking on prose writers and portrait artists dunking on photographers. Thus, he wonders: Is he just dunking on the new thing?

He ultimately concludes that this is not the case. And the way he gets there is by clearing away the objections that aren’t doing the real work for him. Yes, he’s worried about the economics. Yes, he’s bothered by the ethics of training on artists’ work. Yes, the environmental costs matter. But he insists that even in a world where models are trained only on consented data and run on efficient hardware, something would still feel off.

The core of his argument is that AI art collapses art into product. Where he argues that art is also (and maybe primarily?) process: the activity through which a person becomes capable of making the thing. He illustrates this with his own early, not-great juvenilia: books that, for him, worked as “receipts” of him becoming a writer. Art, thus, is the process happening to the person making it.

He ends by concluding that this is why his take on AI art is different from Ebert’s take on video games. Because AI art is fundamentally different from the previous revolutions. And, on a more positive note, since art is what society collectively decides to treat as art, there’s a simple way of “winning” the war against AI art: it suffices that some people decide this form of art is not worth pursuing.

I’m sympathetic to the spirit of this—and still not convinced by the conclusion. I see three problems.

First, although art is socially constructed, that doesn’t mean it’s easily steered. I really wish Brazil still produced music like in the 60s or 70s, and I’m sure a lot of people do. But our wills are not enough, and that stuff only rarely fuels a banger TikTok. Cultural tastes move due to people’s will, but also due to what is cheap, frictionless, and easy. So even if we grant Sanderson the premise that art is what we collectively treat as art, it doesn’t follow that “collective deciding” is a crisp process we can just execute.

Second, I’m not sure I fully buy the historical analogy. Every revolution looks uniquely alien. When photography emerged, one could claim that photography removed the process, just like AI did. And if photography didn’t kill painting, it definitely reshaped it. Painting survived by shifting its aims: toward impression, abstraction, expression, and the kinds of seeing that a camera didn’t naturally deliver.

Third, Sanderson implicitly argues that there is no process in writing with AI art. But isn’t writing with AI akin to painting with a camera? If you define “process” as the physical act of applying pigment, then yes, photography looks like cheating. But if you define process as the chain of choices that expresses taste and intention, photography is saturated with it. So why, in principle, can’t AI art be the same? I understand that the slop being sold on Kindle feels subpar, but so did the first photographs when compared to the paintings of their time.

None of this refutes Sanderson’s insight: art is more than the artifact—or as he’d put it, “journey before destination.” But it does complicate his account of why AI art “feels wrong.” Maybe the bitter lesson here is that, when the time comes, the new and the strange will always find ways to disappoint us. Still, he need not worry about me: 2026 is the year I finally finish Wind and Truth.

The Empiricism Gap in Computer Science

Manoel Horta Ribeiro — Fri, 26 Dec 2025 11:52:48 GMT

So far, I have argued that there is a dissonance between, on the one hand, CS’s founding myths, curricula, and self-image, and, on the other hand, the modern production of knowledge in computer science.

Here, I want to name the dominant style of empiricism that has emerged from that reality, particularly in machine learning. For lack of a better term, I’ll call it build-and-test empiricism. As the name suggests, this flavor of empiricism treats construction as the primary research method: you build a model, a system, or an algorithm, run it, measure the outcomes, and iterate. It works best when the field shares a common yardstick (benchmarks, leaderboards, standardized testbeds) because then disagreement collapses into a simple question: which approach performs better under the same evaluation?

In this modus operandi, knowledge accumulates through artifacts and evidence: a working construction, together with empirical performance results that others can reproduce, compare, and build on. A canonical example is the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which turned image classification into a shared yardstick: everyone trained on the same large dataset and compared on the same evaluation metric. When Krizhevsky, Sutskever, and Hinton entered AlexNet in ILSVRC 2012, they didn’t just propose “a good idea,” but were able to prove the value of their contribution very easily.

I’ll contrast this with another flavor of empiricism, prevalent in political science, economics, and epidemiology. Again, for lack of a better term, I’ll call it describe-and-defend empiricism. This flavor of empiricism treats inference as the first-class citizen: you formulate a claim about the world, assemble evidence, and argue why the data support that claim. It works best when the field shares a common language for what counts as evidence—identification, counterfactual reasoning, and uncertainty—because then disagreement collapses into a more straightforward question: is this claim credible given the design and the assumptions?

In this modus operandi, knowledge accumulates through the refinement and defense of generalizable statements about reality. A canonical example is the minimum-wage debate. Instead of “does this model beat that model on ImageNet?”, the shared question became a claim about the world: do minimum-wage increases reduce employment? Card and Krueger (1994)’s famous New Jersey–Pennsylvania study framed the problem as a quasi-experiment: New Jersey raised its minimum wage in 1992, Pennsylvania did not, so that you can compare employment changes in fast-food restaurants across the border before and after the policy change (a classic difference-in-differences design). Their headline result challenged the textbook prediction, sparking years of dispute not only about the answer but also about whether the identification strategy, measurement, and data sources justified the claim.

That is describe-and-defend empiricism in action: the conversation quickly moved to critiques, reanalyses, and replies (e.g., Neumark & Wascher’s comment and Card & Krueger’s reply in the AER), because the community is built around adversarial scrutiny of credibility. The debate then generalized to broader designs, such as using discontinuities at state borders across multiple policy changes, to test whether the claim holds up across settings.

Maintenance Layer

A helpful way to contrast these different flavors of empiricism more sharply is to separate the engine of each style from its maintenance layer. In other words, once a community has a means of generating results, how does it ensure honesty? What are the routines and institutions that notice when the engine starts producing nonsense: when results don’t replicate, when benchmarks get gamed, when claims are overstated, or when an entire methodological approach is proven incorrect? The maintenance layer is the answer: the field’s error-correction machinery, the set of practices that turns “a pile of results” into something closer to cumulative knowledge.

As we’ve mentioned, build-and-test empiricism has an apparent engine: artifacts plus shared yardsticks. But shared yardsticks come with predictable pathologies. Because benchmarks are finite and public, it is easy to overfit not only models to data but scientists to metrics: countless small choices (hyperparameters, prompt formats, data filtering, selection of seeds) get tuned until they look good on this evaluation. Reproducibility also becomes fragile: in modern ML, results can turn on details that are hard to fully specify—hardware, distributed training quirks, nondeterminism, dataset versions, and preprocessing pipelines. And even when a paper’s headline number is “real,” it can be a narrow kind of reality: a win that evaporates under distribution shift, under a slightly different label definition, or in a deployment setting that wasn’t represented in the benchmark.

The community’s response is often to patch the process with rituals—lightweight forms of quality control meant to raise the floor without slowing the iteration loop too much. Some rituals target reporting (checklists; standard tables and ablations), some target auditability (artifact evaluation and badging; code and data release norms), and some target interpretability and scope (model cards, datasheets, documentation templates). None of these rituals “solves” the epistemology of generalization. But they are a pragmatic maintenance strategy: a way to keep the engine running while limiting the most common failure modes, and to make it harder for the field to confuse a brittle benchmark win with a robust piece of knowledge.

Describe-and-defend empiricism has a different engine: claims anchored in an argument about why the data identify what you say they identify. Its maintenance layer looks different, too. Because the risks are not “overfitting the benchmark” so much as “fooling ourselves with confounding, researcher discretion, or fragile specifications.” Thus, the field expends considerable effort on robustness: alternative specifications, placebos, sensitivity analyses, replication across contexts, and explicit uncertainty quantification. The rituals exist here too, pre-analysis plans, preregistration, replication packages, but they are downstream of a deeper norm: you do not get to make a strong claim unless you can defend it.

No better way to see the difference between the two maintenance layers than to observe how each ecosystem responds when it discovers a methodological error. In econometrics, the recent “difference-in-differences reckoning” is a good example: researchers realized that the vanilla two-way fixed effects DiD regression behaves badly under staggered adoption and heterogeneous treatment effects (Goodman-Bacon, 2021). In plain English, people were doing bad stats for a couple of years. The response was not to “use a checklist next time.” Instead, the field produced new estimators, review papers with concrete recommendations, and, crucially, a wave of re-analyses: applied papers were re-estimated with updated methods to see which conclusions survived (e.g., Nagengast and Yotov, 2025).

Machine learning handled “mistakes” differently. For example, when a benchmark is found to be leaky, saturated, or overly gameable, the community typically shifts the goalposts rather than revisiting the back catalog. Consider ImageNet: when Recht et al. (2019) built fresh test sets and found nontrivial accuracy drops, evidence that the community had partially overfit to a heavily reused benchmark. The typical response was not to issue corrigenda for years of ImageNet papers. Instead, the field continued to iterate on the next yardstick. E.g., by the time Recht published his paper, computer vision’s center of gravity had already shifted from image classification to tasks such as detection, segmentation, and captioning, and to datasets such as MS COCO. Old papers remain “fine,” not because their conclusions are timeless, but because they were produced under the rituals that were locally acceptable at the time. The maintenance layer is oriented toward “patch-and-proceed” rather than “re-estimate-and-reckon”.

Why build-and-test empiricism is not enough

This piece is a critique of building-and-testing empiricism. But it is worth steelmanning it before criticizing. In many ways, the progress in ML and CS more generally is the envy of other disciplines. Build-and-test empiricism creates fast feedback loops, rewards concrete contributions, and makes knowledge unusually transmissible. Under that regime, research “compounds”: the community shares datasets and code, converges on testbeds, and iterates quickly enough to discover regimes that theory did not predict. Moritz Hardt put this better than I ever could in his new book on ML benchmarks: “researchers might’ve indeed dismissed the idea of benchmarking as unreasonable, if it hadn’t been so successful.”

So what’s the problem? Especially outside of machine learning, CS increasingly studies domains that lack a benchmark to guide iterative development, and many of the most critical questions are explicitly causal: Does this interface change behavior? Does this policy reduce harm? Does this intervention improve well-being? When build-and-test empiricism loses its yardstick and lacks a shared causal language, it becomes easy to produce papers that look empirical but do not cumulate. It is unclear what claim was actually identified, and unclear what a “better” result should even mean.

The deeper issue is that we often keep a build-and-test mentality even when the object of study has shifted from artifacts to claims. In ML, a narrow win can be “fine” because the claim is narrow: this method improves this metric under this protocol. But in HCI, privacy, and policy-adjacent CS, the implied claim is usually broader: that some design choice will change how people behave, or that an intervention will improve real outcomes. Those are high-stakes claims—and yet our field rarely treats them with the institutional seriousness that claim-centered disciplines have built up over decades: routine re-analyses when a standard method is criticized, explicit norms for what constitutes robustness, or a culture in which the community revisits and revises empirical conclusions.

This gap is most evident in the genre expectations of our papers. The typical “implications” section in CHI is written as if it were a policy memo, even when the evidentiary base is modest: a lab study with a convenience sample, a deployment in a single setting, or interviews with a dozen participants. None of these is an invalid method. But they support a particular kind of claim, under particular scope conditions. When we glide from “here is what we observed in this context” to “here is how platforms serving millions should behave,” we invite exactly the kind of scrutiny that economics seminars institutionalize: what is the counterfactual, what is the estimand, what is the uncertainty, and what would change your conclusion? This mismatch between our ambition and the discipline of our inferential machinery undermines CS’s ability to shape the real world and weakens its credibility with policymakers and other audiences trained to ask such questions.

But the hole is even deeper, and there are now problems even in the most “yardstickable” subfields. The trouble is that the same properties that make this style powerful also make it easy to confuse local wins for general truths. A benchmark score indicates that a method performed well under a specific evaluation protocol. When the field treats “state of the art” as synonymous with “we now understand the phenomenon,” it blurs the distinction between works here and works in general, and between what works and what is true. Worse, the benchmark can quietly become the target: we optimize for the metric, and then retrofit narratives about the underlying capability.

If you want examples here, look no further than the COVID craze, the few months where researchers from every field (including myself) abandoned whatever they were doing to do “a COVID-related” paper. In ML, the result was a flood of models with glossy AUCs and clean-looking tables, but surprisingly little durable knowledge. Roberts et al. (2021) argued that much of the literature suffered from predictable methodological pitfalls—dataset bias, leakage, non-representative sampling, and weak evaluation protocols. These were severe enough that, in their study, none of the 2,212 models proposed to detect and prognosticate COVID-19 from chest radiographs and CT scans were considered potentially clinically useful. Take a second to consider this. It is just insane. 2,212 papers were misled by creating their own “local” win through a benchmark and ignoring the real problem they were supposed to be solving.

In this case, the “yardstick” is still somewhat crisp, and you could imagine a responsible researcher engaging more seriously with the literature and getting something worthwhile. Yet, machine learning and computer science, more generally, are increasingly encroaching on domains where the “yardstick” is no longer crisp and where there is not a single yardstick, but multiple. Generative AI has dragged the field toward tasks that are more abstract, open-ended, and value-laden: helpfulness, harmlessness, truthfulness, creativity, persuasion, alignment, “good judgment.”

As a result, the build-and-test engine begins to wobble: we still want shared yardsticks, but the target is moving, and the metric often serves as a proxy for something we cannot directly observe. The field responds in its characteristic way, by building new benchmarks, rubrics, and evaluator models. Still, under this new regimen, it is much easier for “what scores well” to drift away from “what is actually good,” and much harder to know whether an apparent improvement reflects a real capability gain or simply better exploitation of the measurement procedure.

Why can’t we do both?

This is where the describe-and-defend approach to empiricism offers an essential lesson for CS. It provides a shared language for what a claim entails: its assumptions, counterfactuals, uncertainty, and scope. Without that language, the community tends to reach for rituals as a substitute for reasoning: more checklists, more templates, more badges. These rituals often help. But they are a thin proxy for the greater skill of asking: what exactly does this result establish, and what would it take for it to stop being true?

A recent position paper by Wallach et al. (2025) makes this point, but aimed squarely at the GenAI moment: evaluating generative AI systems is a social science measurement challenge. They argue that many GenAI evaluations amount to “apples-to-oranges comparisons,” and that the ML community would benefit from importing measurement theory: being explicit about what construct you claim to measure (capabilities, behaviors, impacts), how your instrument operationalizes it, and what validity evidence would justify treating the result as knowledge rather than noise. \

But before we get too trigger-happy about the conclusion that “we should all be social scientists now,” it is worth scrutinizing describe-and-defend empiricism for the messy beast it also is. For one, it is often slow. When the unit of progress is a defensible claim about the world, you pay for credibility in calendar time: collecting better data, negotiating access, validating measures, and stress-testing identification strategies. And even then, the best you can sometimes do is a careful estimate under plausible assumptions, but never fully provable.

Worse, describe-and-defend carries its own failure modes. When the reward is a clean claim, it can incentivize overconfident narratives, specification search, and knife-edge identification strategies that look airtight on paper but are fragile in practice. Robustness checks help, but they can also become rituals of their own, performed because reviewers expect them, not because they genuinely change anyone’s mind. Even credible causal estimates can be narrow: a precisely defended effect in one setting that is difficult to generalize, interpret mechanistically, or translate into design guidance.

The punchline is not “CS should become economics” or “every paper needs a preregistration.” The punchline is that CS needs to become bilingual. We should keep build-and-test empiricism—its speed, openness, and compounding progress are real achievements—but we should supplement it with a second language of empiricism that lets us state and evaluate claims with more precision. That language helps us separate exploration from confirmation, local wins from generalization, and performance improvements from causal explanation. In other words: keep the engine, but upgrade the epistemology.

What would this look like in practice? It starts with teaching a small set of concepts that claim-centered fields treat as basic literacy: what is the estimand we are trying to learn, what is the counterfactual, what assumptions connect the data we have to the claim we want, what are the threats to identification, what uncertainty do we actually have, and what robustness checks would meaningfully stress the conclusion. It also means taking measurement more seriously in settings where CS is rapidly expanding (especially in generative AI) by being explicit about constructs, validity, and scope.

Finally, it requires a cultural shift in the rewards we value. Not every paper needs to make a sweeping claim; sometimes the honest contribution is a careful measurement instrument, a negative result, or a boundary condition that shows where an approach breaks. And when a methodology is shown to be flawed, we should normalize reanalysis as part of the scientific process. If build-and-test is how CS moves fast, then principled inference is how it learns what it has actually learned.

The Empiricization of Computer Science

Manoel Horta Ribeiro — Wed, 17 Dec 2025 13:28:31 GMT

Scientific disciplines, like nations, have their own founding myths. These may sound inconsequential at first, but they deeply shape how disciplines see themselves, what they value, and what they believe their role in the world ought to be. I’d argue that computer science was built upon founding myths around theory and system building. Princeton has a very exciting course called “Great Moments of Computing,” where Margaret Martonosi walks students from the foundations of digital logic and differential privacy to the creation of network protocols and the computer mouse. In a sense, these are the closest equivalents computer scientists have to a Durkheim, a Smith, or a Freud.

Examples of the moments covered in Martonosi's course.

Yet the nature of computer science has changed tremendously: it is now an empirical science, driven by observation and experiment as much as by theory and construction. If you don’t believe me, spend 5 minutes going over the best paper awards for three conferences in different subfields in CS. If you did so in 2025, you might have found papers like “Characterizing and Detecting Propaganda-Spreading Accounts on Telegram” (USENIX; Security and Privacy), or “Examining Mental Health Conversations with Large Language Models through Reddit Analysis” (CSCW; Human Computer Interaction), or “Scaling Depth Can Enable New Goal-Reaching Capabilities” (NeuRIPS; Machine Learning). The first two papers describe online user traces, to understand sociotechnical phenomena (Online Propaganda, people using LLMs to talk about mental health issues), while the latter is a full-blown examination of what happens when you do self-supervised RL with very deep neural networks.

This empiricization of Computer Science seems to have happened independently across subdisciplines, and I argue that the reasons are two-fold.

Reason #1: Socio-technical integration of computing

First, computer scientists steadily broadened the scope of what they treated as “a computing problem.” Part of this is simply baked into the field’s origins: computing inherited deep roots in mathematics and engineering, but it also grew up in constant contact with other disciplines: think cybernetics or cognitive science As computing technologies matured, they also began to fade into the background of everyday life, becoming infrastructure rather than an object of conscious attention. In his seminal “The Computer for the 21st Century” essay Mark Weiser captured the consequence of that transition: “only when things disappear in this way are we freed to use them without thinking and so to focus beyond them on new goals.”

Privacy is a particularly vivid example. In an earlier framing, “privacy” looks like a technical property of systems—confidentiality, secure channels, access control—problems that naturally invite cryptographic solutions. But as privacy disputes increasingly played out in real social settings, the field had to confront privacy as something defined by people’s expectations, norms, and context. Nissenbaum’s theory of contextual integrity crystallizes this shift by tying privacy to whether information flows are appropriate relative to contextual norms. Even when the cryptography is excellent, privacy can still fail at the human interface: if users pick weak or reused passwords, attackers don’t need to “break” encryption, just to log in!

Another example is the evolution—and very existence—of the field of human–computer interaction. SIGCHI’s early curricular definition makes the scope expansion explicit: HCI is concerned with the design, evaluation, and implementation of interactive systems “for human use,” and with “the study of major phenomena surrounding them.” In other words, once the object of study is interaction rather than computation alone, empiricism is no longer optional: you need user studies, experiments, field observations, and iterative evaluation to know whether a system is usable, safe, or meaningful. Over time, HCI also widened what it counts as “interaction”. Bødker (2006) describes the field as moving from a first wave rooted in human factors and design, to a second wave shaped by cognitive science, and then to a third wave emphasizing situated meaning, values, and social context.

Just as HCI institutionalized user studies, the systems and networking communities institutionalized measurement itself. SIGMETRICS emerged as a way to measure the differences in performance across computers (a subtly challenging task). And while you could argue there’s no “societal factor” here, this changed the moment the Internet became a live, evolving, operational system. Core properties of the Web “worked” were elusive because they relied on how people and institutions used it in practice. This led to the need for a new, much more empirical type of research that required sophisticated measurement infrastructure. For instance the SIGCOMM Internet Measurement Workshop (2001) explicitly asked for work that improves understanding of how to collect/analyze measurements or “give insight into how the Internet behaves,” and it stated bluntly that papers not relying on measurement were “out of scope.”

Finally, the rise of computational social science blurs the boundary between “CS papers” and “social science papers.” Lazer and colleagues open their manifesto with the observation that “we live life in the network,” leaving digital traces that can be assembled into detailed pictures of individual and collective behavior. In that world, it becomes increasingly hard to draw a clean line between disciplines: the same project may involve building data pipelines, designing machine learning models, running quasi-experiments and interpreting results through social science theories. Go to a venue like IC2S2, and you’ll find people in CS working on exactly the same kinds of problems as people in Sociology or Political Science.

Taken together, these shifts mark a quiet redefinition of the discipline: as computing “disappeared” into infrastructure, the unit of analysis stopped being the machine in isolation and became the human–system–institution loop.

Reason #2: The Triumph of Empiricism

Second, you have the triumph of empiricism, particularly in machine learning. If you dial back to the late 90s and early 00s, neural network, or “connectionism,” was remarkably unfashionable. In a 2015 interview with IEEE Spectrum, Turing Awardee Yann LeCun said: “I have very little patience for people who do theory about a particular thing simply because it’s easy—particularly if they dismiss other methods that actually work empirically, just because the theory is too difficult. There is a bit of that in the machine learning community. In fact, to some extent, the “Neural Net Winter” during the late 1990s and early 2000s was a consequence of that philosophy; that you had to have ironclad theory, and the empirical results didn’t count.”

But at the same time as neural networks were “shunned,” something else was happening in Machine Learning Land: the discipline started to become increasingly oriented around shared tasks and competitions. Perhaps this was only natural for, as one of our founding myths is the Turing test: a challenge were a machine must exhibit behavior indistinguishable to that of a human. Isabelle Guyon, a co-inventor of the support vector machine, shares a brief history of data in her 2022 NeurIPS talk titled “How ML is Becoming an Experimental Science”. She notes that the field evolved from toy data, to shared datasets (most prominently made available in the UCI ML Repository) to larger datasets that were often embedded in a explicitly competitive context. Some prominent early competitions include the KDD Cup, hosted by the Association for Computing Machinery (ACM), and competitions hosted by the National Institute of Standards and Technology (NIST), one of which introduced the now famous MNIST digits database.

Mandatory ImageNet image.

As compute and data became plentiful, deep learning flourished. A canonical turning point is AlexNet’s win at the 2012 ImageNet Large Scale Visual Recognition Challenge (ILSVRC). ImageNet was unusually ambitious for its time—an effort to scale visual recognition with millions of labeled images. And ILSVRC distilled that vision into a standardized benchmark of 1,000 categories with about 1.2M training images. In 2012, Krizhevsky, Sutskever, and Hinton trained a large convolutional network on this dataset and crushed the field: their entry achieved a 15.3% top-5 error, compared to 26.2% for the runner-up.

The increasing adoption of doing “what works” marked an inversion of how knowledge ought to be accumulated in machine learning. Before, experiments were seen as confirmatory. The idea was that better theory would lead to better models, and that experiments were a way to confirm that a model was indeed better (and that perhaps the theory did not have any flaws). For example, in a 1999 piece, Vapnik describes the rise of support vector machines in the mid-1990s as learning algorithms “based on the developed theory,” explicitly positioning statistical learning theory not only as analysis but as a driver of practical algorithm design.

In the deep learning era, the direction often runs the other way: models are frequently guided by intuition and engineering heuristics first, with theoretical framing arriving later (if at all). The original dropout paper is almost playful about this, stating that the motivation for the method “comes from a theory of the role of sex in evolution (...). A closely related motivation for dropout comes from thinking about successful conspiracies” (Srivastava, 2014).

Theory, then, often enters as post hoc explanation and consolidation. A prominent follow-up to dropout was literally titled “Understanding Dropout,”, and it set out to formalize what dropout is doing (as averaging/regularization, and in simple settings as optimizing a regularized objective), giving a theoretical account for an empirically successful trick. A similar arc shows up in the “double descent” story: modern overparameterized models displayed test-error behavior that looked inconsistent with the textbook bias–variance tradeoff, and only afterward did theory work emerge to reconcile the classical U-shape with the empirical “interpolation peak” and the subsequent performance improvement at larger scales. Again, theory did not inform that models should be overparameterized so that at some point, the error rate would go down again. People just tried doing things and some of them worked well.

And someone might argue: “But this is only machine learning—CS is so broad!” But is it really? The meteoric rise of machine learning—and, more recently, generative AI—has started to behave less like one subfield and more like a method layer that percolates through the rest of computer science (and increasingly, through the sciences more broadly). In 2024, the Physics Nobel prize went to Hopfield and Hinton “for foundational discoveries and inventions that enable machine learning with artificial neural networks,” and the Chemistry prize went (in part) to Hassabis and Jumper for an AI model that solved the protein-structure prediction problem (AlphaFold). Mathematics may be at the cusp of a revolution in which language models become integral to cutting-edge proofs—as Terence Tao’s recent account of the human–AI collaboration suggests. On the CS side, conferences like MLSys exist precisely because “systems” and “machine learning” have become deeply entangled—enough to warrant a dedicated flagship venue at their intersection. And in HCI, a growing slice of the field is now about designing with the inherent messiness of language models, leading to new interface patterns and evaluations aimed at helping users cope with these failure modes.

So what?

This whole rant started as a “we don’t have empiricism as a founding myth.” But then, I’ve spent quite a bit of ink arguing that the empiricization of computer science has already happened. So what really is the problem here?

The problem is how we became empirical. Too often, we practice a kind of post-hoc, radical empiricism: run what we can run, keep what works, and only afterward scramble for the vocabulary of validity, uncertainty, and explanation. The tell is that we keep patching rigor in after the fact (through checklists, artifact badging, and community process fixes) because we never made principled empiricism a first-class part of the curriculum.

Watch for the next post for more on that and on what a more principled empiricism would entail in CS.

Labeling Data with Language Models: Trick or Treat?

Manoel Horta Ribeiro — Sat, 25 Oct 2025 12:13:18 GMT

Large language models are now labeling data for us. In dozens of new studies, researchers have quietly replaced undergrad coders and Mechanical Turkers with GPT4 or Gemini 2.5. What began as a convenience is fast becoming a methodological revolution! And one that could reshape how social scientists measure the world.

As I’ve been catching up with the literature, I see three concurrent “waves” of work around this idea.

Early papers were awed by LMs’ annotating abilities. Back in 2023, Gillardi et al. showed that “ChatGPT outperforms crowd workers for several annotation tasks, including relevance, stance, topics, and frame detection.” And it did so at a fraction of the cost—roughly thirty times cheaper than human annotators. Similar results quickly followed, suggesting that LMs might transform quantitative social science (see Törnberg, or this recent tweet). In ML and NLP circles, this trend became known as LM-as-a-judge (Zheng et al.; Kim et al.)

The next wave asked a harder question: how do we do this right? Ziems et al. systematically analyzed LM annotations across models and tasks, producing general usage guidelines. Others accepted that LM labels will never be perfect and proposed ways to account for their bias. Two go-to methods here are DSL by Egami et al. and CDI by Gligorić et al. The basic idea of these approaches is to use a smaller set of human annotations to “ground” the LM-generated labels. Once you model the LM bias relative to the human sample, you can then correct your final estimates. Calderon et al. go a slightly different route. They use a sample of data annotated multiple times in an “alternative annotator test,” which justifies using an LM rather than a human annotator (because it outperforms annotators in aligning with the majority vote).

Finally, a “third wave” of work has currently raised the alarm that LMs-as-annotators might lead to bad science. In perhaps my favorite paper of this wave, Barrie et al. show that even controlling the temperature, using proprietary models as annotators leads to unacceptably high variance in performance over the course of many months. Yang et al. replicated 14 recently published papers to find that “LLM annotations have low intercoder reliability with the original annotations and moderate reliability among the LLMs themselves.” Baumann et al. find that conclusions drawn from LLM annotations often differ from those from human annotations, depending on the LLM used and its configuration. They define this term “LM hacking” as mistakes that “propagate to downstream analyses and cause Type I (false positive), Type II (false negative), Type S (wrong sign), or Type M (exaggerated effect) errors.”

Okay, so where do we go from here?

First, we need to clarify what problem we’re actually solving.

For example, the aforementioned LM hacking paper by Baumann et al. groups all errors under the same umbrella, e.g., they consider not finding an effect when one exists as “LM hacking.” The paper then concludes that methods such as DSL and CDI are not “enough” to address LM hacking. This conflates error type with scientific risk. DSL and CDI explicitly treat LM labels as noisy measurements: they use a small human “gold” set to estimate measurement error and then debias or propagate that uncertainty. By design, they trade some power (more Type II) to reduce false discovery (Type I)—the same logic behind strict multiple-comparison controls (e.g., Bonferroni increases Type II while lowering FWER)

So, really, the meta-science question here is about efficiency, not purity. In the social sciences (and perhaps in general), we are currently constrained to explore only a very small subset of the total space of interesting questions worth answering (see this banger; or this one). So let's ask: how can we produce scientific knowledge more efficiently?

Suppose a research group has an annotating budget S and a series of K questions they want answered. On the extremes, they can approach this in two ways:

They can spend S/K dollars annotating data for each question they are interested in, and use DSL or an equivalent method to generalize annotations to the whole sample.
They can order the top K questions by interest and, for the top ones, spend the budget S annotating a sample large enough to conduct the analysis.

Of course, an optimal strategy is probably something in between. But under which broad regime can science progress more efficiently? If the calibration methods are good enough, probably something closer to the first extreme. And there’s nothing uniquely “LM-ish” about this pipeline—the same logic would have applied in the BERT era, though the subset we needed to hand-label might have been larger then.

Second, we should clarify the threat model of malicious actors who may engage in bad science.

It’s worth remembering that we’ve been here before. The current anxiety around LM use echoes the credibility revolution in psychology and social science during the 2010s. Back then, the problem wasn’t that everyone was faking data (though a few were) but that researchers had too many degrees of freedom: selective reporting, flexible stopping rules, and post-hoc hypothesizing made false positives almost inevitable.

However, when it comes to curbing truly malicious use of LMs, I wonder whether LMs create “new degrees” of freedom for cheating that matter. After all, exclusions and minor tweaks to data that can impact the outcome were always possible. Maybe the threat of “intentional” LM hacking is that LMs create a scenario where this is truly impossible to track (without access to compute logs), but I wonder whether the delta here is substantial to matter (as, frankly, I guess most people could change their data and be fine, as long as you don’t do it in Excel).

Perhaps the threat is of a second-order nature, e.g., threatening the way we conduct science, making us more lazy and dishonest when delegating tasks to LMs, or ushering in a phase of scientific inquiry in which “we produce more and understand less.” But still, it seems to me that research addressing these threat models would provide the strongest evidence by finding evidence of said misuse, rather than hypothesizing it.

Third, I think we need to embrace the fact that human labels can also suck. And that failing to reproduce studies does not necessarily tell you that LMs are “wrong.”

In Yang et al., it is worth noting that most of the annotated labels studied are a mix of human and previous ML methods (e.g., BERT). They find that interannotator agreement between strong LMs is around 0.6. But we’re not even sure what human interannotator agreement is across the tasks presented, to begin with. Authors find another report suggesting that n=20 studies reporting it have an average agreement of 0.73, but the figure smells of selection bias.

This ambiguity points to a deeper problem: when LM and human annotations diverge, we have no reliable way to know which side is wrong—or whether “wrong” even applies. Disagreement doesn’t reveal error so much as the instability of the categories themselves. In many social science tasks, ground truth is not discovered but negotiated through convention, as human annotation is too costly to do over and over again (just to measure variability). LMs are cheap, however, and may thus expose, rather than cause, the fuzziness of our constructs. Systematic gaps between human and model judgments can reflect algorithmic bias, but just as easily fatigue, inconsistency, or ideological drift among human coders. The uncomfortable reality is that for many annotation tasks, truth has always been approximate. Replication failures with LMs don’t necessarily prove the models are broken; they remind us that our human baselines were never as solid as we liked to think.

My very centrist take is that labeling with LMs is neither trick nor treat. The real challenge is not replacement but calibration—whether we can model and report uncertainty with the same rigor we expect from any other instrument. LMs haven’t made annotation noisier; they’ve made its noise impossible to ignore. By exposing the randomness, inconsistency, and subjectivity that have long been hidden in human labeling, they force social science to confront its own measurement fragility. If we take that challenge seriously, LM labeling could make the field not lazier, but more transparent and scientific about its own uncertainty.

Acknowledgements: I thank Kristina Gligorić, Paul Röttger, and Joachim Baumann for feedback on this blog post.

The Missing Discipline in Computer Science

Manoel Horta Ribeiro — Sat, 04 Oct 2025 13:33:14 GMT

Henderson’s first law of econometrics reads:

When you read an econometric study done after 2005, the probability that the researcher has failed to take into account an objection that a non-economist will think of is close to zero. (Source)

I dare to propose a similar law for machine learning:

When an economist reads (and understands) an empirical machine learning study done after 2022, the probability that they will think of an objection that the researcher has failed to take into account is close to one.

And why is that? Because the two fields treat empiricism in opposite ways. Econometrics was forged in the crucible of skepticism. Every paper is a defensive war against omitted variables, selection bias, and endogeneity. Its practitioners have been trained to see identification not as a formality, but as a matter of survival. Sit in on one of their seminars, and you’ll witness the kind of interrogation that, in Computer Science, would pass for an ambush.

In machine learning, by contrast, the prevailing norm is demonstration, not falsification. Success is typically measured by predictive performance on a benchmark, rather than by causal clarity or robustness to alternative specifications. Yet, papers in machine learning increasingly treat models as behavioral systems to be studied experimentally (“Can Large Language Models…”). These papers make causal claims about systems—about how model behavior changes with prompts, environments, or training regimes—yet they timidly engage with the identification concerns that would accompany such claims in the empirical sciences.

This is not a machine learning problem alone—it permeates Computer Science, from HCI to Security and Privacy. Perhaps the reason lies in the field’s origins: a discipline born from the union of system building and theorem proving, where progress is demonstrated through construction, not systematic observation.

So what? I don’t envy economists who spend five years publishing a paper. But I do believe Computer Science would benefit from adopting more of the empirical culture found in disciplines like economics and political science. The easiest place to start is education. Next semester, I’ll be teaching a graduate seminar titled Empirical Research Methods for Computer Science, partly because I want my students to have these tools—and partly because I want to ask a deeper question: what should the core empirics curriculum of Computer Science look like?

Below is my sketch, which I've spent a week working on.

First, we need to provide students with a vocabulary for causality. Many of the questions that computer scientists now confront—Does fine-tuning a model change its fairness? What is the effect of personalization on engagement?—are fundamentally causal questions. Yet few students have the conceptual tools to pose these questions precisely, let alone to answer them. Our students need a version of this material focusing on how to design credible experiments, reason about identification, and interpret evidence with humility.

Second, we need to teach students about regression. But not as a tool for prediction, as it is taught in machine learning courses. We need to teach them regression as a tool for inference: a way to estimate relationships, test hypotheses, and reason about uncertainty. Computer science students rarely learn to read a regression table, let alone to interrogate one. Yet these habits—thinking in terms of confounders, robustness checks, and sensitivity analyses—are what make empirical results interpretable and cumulative.

Third, we need to teach students about benchmarks. Benchmarks have long served as the de facto empirical infrastructure of Computer Science. Yet few students are ever taught to think critically about them. What does a benchmark actually measure? When does improving performance reflect genuine scientific insight, and when does it reflect overfitting to a proxy task? Treating benchmarking as a scientific problem in its own right—concerned with validity, reliability, and construct design—would help students see that measurement is a form of theorizing about the world.

Fourth, we need to teach students about experimental design in the age of Computer Science. Our field now runs experiments at a scale and speed unimaginable in most other sciences, yet the principles remain the same: randomization, power, validity, and ethics. Students should learn how to design credible experiments, estimate statistical power, and ensure data quality in online settings—from platforms like Prolific to in-product A/B tests. These practices are not bureaucratic hurdles; they are what make empirical claims reproducible and trustworthy.

None of these things is “hard,” and good material exists for a lot of them. Brady Neal has a great course on causal inference; Moritz Hardt has an upcoming book on benchmarks; and some recent Economics textbooks provide a very gentle introduction to causality and quasi-experimental methods. Yet for empirical rigor to take root in Computer Science, it needs to become part of how we train students to think—not just a set of optional skills. We need to signal that understanding identification, regression, and experimental design is as fundamental to being a computer scientist as knowing how to optimize an algorithm or prove a theorem. Only then can our field move from demonstration to explanation, and from performance to understanding.

The Lean Chinese Room

Manoel Horta Ribeiro — Mon, 25 Aug 2025 14:54:52 GMT

The Chinese Room thought experiment is a go-to motivation for AI skeptics. It goes like this. Imagine a man locked in a room with a giant rulebook written in English. Slips of paper with Chinese characters are passed into the room. The man, who knows no Chinese, looks up the symbols in his rulebook, follows the instructions to the letter, and produces output characters that he passes back outside. To outsiders, the responses look fluent, even intelligent. But inside, the man is just shuffling symbols according to syntactic rules. There is no genuine understanding, no semantics, only blind symbol manipulation. (The original, by John Searle, is quite short; read it here.)

Criticisms of this thought experiment are plentiful (I actually highly recommend the Stanford Encyclopedia of Philosophy’s page on this). Here’s a quick summary: some argue that Searle mislocates understanding: while the man may not know Chinese, the system as a whole (man, books, and memory) might (the “Systems Reply”). Others suggest that running the program could generate a virtual mind distinct from the operator, which could be the true bearer of understanding (the “Virtual Mind Reply”). Embodiment theorists counter that meaning requires interaction with the world: put the program in a robot with sensors and effectors, and the symbols might gain genuine semantic grounding (the “Robot Reply”). Connectionists argue that if a program simulated the actual firing of neurons in a Chinese speaker’s brain, it would thereby instantiate understanding (the “Brain Simulator Reply”). Still others note that we attribute knowledge to other humans solely based on behavior; if a machine behaves equivalently, why not extend the same presumption (the “Other Minds Reply”)?

But what caught my eye are the arguments that question the Chinese Room argument due to its utter inefficiency. Namely, Steven Pinker in his 1997 book “How the Mind Works” writes:

‘The thought experiment slows down the waves to a range to which we humans no longer see them as light…. Similarly, Searle has slowed down the mental computations to a range in which we humans no longer think of it as understanding (since understanding is ordinarily much faster).’

Similarly, in his 1980s paper "Fast Thinking" (I actually couldn’t find the original paper referenced in SEP), philosopher Daniel Dennett argues that the speed of processing is essential to any practical definition of intelligence:

If you can’t figure out the relevant portions of the changing environment fast enough to fend for yourself, you are not practically intelligent, however complex you are”

Another insightful paper by Daniel A. Wilkenfeld posits that “Understanding as compression,” he writes:

Understanding is a matter of compressing information about the understood so that it can be mentally useful. On this account, understanding amounts to having a representational kernel and the ability to use it to generate the information one needs regarding the target phenomenon.

Reading this literature got me thinking of a “counter” thought experiment. Which I call “The Lean Chinese Room.” Imagine we depart from the exact room envisioned by Searle, but now, we start making things more complicated for the poor man who is trapped inside. Each day, we impose stricter conditions on him to generate his response.

First, each day, the translation manuals are shortened. They still contain enough information for the man to generate the correct outputs, but in increasingly compressed form—abbreviated tables instead of exhaustive ones, general rules instead of endless lists of examples. The man is now forced to rely on compact kernels of information from which the rest can be reconstructed, rather than simply following brute-force, step-by-step recipes.
Second, there is now a timer in the room, which starts ticking the moment a translation request is slipped under the door. The man must respond before the bell rings. No longer can he spend hours leafing through massive binders to find the right match. He must become efficient, quick, and resourceful, producing his answers under the constraint of speed.

My thought experiment is: is there any moment at which the man must “understand” to be able to fulfill his duty? At first, perhaps not. He can lean on the manuals and plod along, however slowly. But as the manuals shrink and the timer grows stricter, the burden shifts. To keep producing fluent answers, he must internalize shortcuts, general rules, and patterns that go beyond rote symbol-shuffling. At some point, the line between following instructions and genuine grasp blurs. If he can reconstruct meaning from compact cues and respond with speed and flexibility, we may be hard-pressed to deny that some form of understanding has emerged.

What, then, is he actually understanding? Perhaps not “Chinese” in the complete sense of an embodied speaker—he still has no taste of a hamburger or memory of Beijing. But he is no longer just pushing symbols blindly either. He has begun to understand the system itself: the patterns, abstractions, and compressed rules that allow Chinese to function as a communicative code. In other words, his performance reflects a structural or formal understanding, even if it is not the lived, grounded understanding Searle insists on. And suppose that structural understanding is enough to generate fluid, timely, context-sensitive responses. In that case, the boundary between mere manipulation and genuine comprehension becomes less clear than the original thought experiment suggests. The Lean Chinese Room thus complicates the stark form/meaning divide: under the pressures of compression and speed, “understanding” may emerge in degrees, not as an all-or-nothing property.

Okay, but maybe we strayed too far from the opening sentence, where I said “The Chinese Room thought experiment is a go-to motivation for AI skeptics.” Let me rewind. Emily Bender and Alexander Koller (2020) argue that large language models, trained purely on predicting sequences of words, cannot, in principle, acquire meaning. Their central claim is that meaning is the relation between linguistic form and communicative intent, grounded in the world. Models like BERT or GPT, in their view, only ever learn statistical regularities over form, no matter how impressive their outputs appear. They illustrate this with their now-famous “octopus test”: a clever octopus that eavesdrops on human conversations might learn to predict how speakers reply, but lacking any access to the world those utterances describe, it could not genuinely understand what the words mean. In short, language models may mimic the surface of understanding, but because they are trained on form alone, they lack the crucial grounding that ties symbols to the world. In many ways, this is a restatement of the Chinese Room argument in contemporary form. Searle’s man shuffles symbols without knowing what they mean; Bender and Koller’s octopus predicts responses without ever grasping the world those responses refer to. Both thought experiments drive at the same intuition: fluency with form is not the same as grasp of meaning.

The Lean Chinese Room complicates this stark form/meaning divide. Like Bender and Koller’s octopus, the man in the room begins in a position of pure form manipulation. But as the manuals shrink and the timer ticks, his strategies must change: he is forced into compression, pattern extraction, and fast deployment. In other words, his behavior begins to look less like rote symbol-pushing and more like the internalization of structural knowledge. He still may not “know what a hamburger tastes like,” but he has acquired a functional, compressed kernel that lets him reconstruct meaning-like responses on demand. This suggests that understanding might not be a binary property requiring full world-grounding, but something that can emerge in degrees from the interplay of form, compression, and efficiency.

If that is right, then Bender and Koller’s sharp boundary between “form-only” systems and systems that achieve meaning may be too rigid. The Lean Chinese Room thought experiment shows how, under constraints of compression and speed, purely formal manipulation can drift into something we are tempted to call understanding, at least in the structural or functional sense. This does not refute the need for grounding altogether, but it weakens the claim that systems trained only on form are categorically barred from any understanding. Instead, it raises the possibility that there are layers of understanding: structural understanding rooted in form, and embodied understanding rooted in the world. Large language models may not have the latter, but they might well be on the path toward the former.

The Retreating Human: God of the Gaps Logic in AI Resistance

Manoel Horta Ribeiro — Sun, 17 Aug 2025 12:15:19 GMT

I first came across the "God of the Gaps" concept when reading Richard Dawkins as an angsty teenager. Dawkins dedicates a whole chapter of God Delusion to complain about how something mysterious and previously seen as divine—lightning, disease, the complexity of the human eye—is finally explained by science. Some religious apologists will retreat to the next unexplained frontier as evidence of the existence of God. But this piece is not about religion, and criticisms of this tendency to attribute God to what we cannot yet explain have also bothered the pious. From his cell in WW2, German theologian Dietrich Bonhoeffer wrote:

(...) How wrong it is to use God as a stopgap for the incompleteness of our knowledge. If, in fact, the frontiers of knowledge are being pushed further and further back (and that is bound to be the case), then God is being pushed back with them, and is therefore continually in retreat. We are to find God in what we know, not in what we don't know.

This counter-argument, which was deeply embedded within the 2000s/2010s neo-atheism movement, couldn’t be more timely.

First, because AI is becoming “the hot” debate topic these days. There are debate subreddits dedicated to it (r/aiwars), visceral reactions abound, and intellectual careers are being built around the topic. But I can’t help but think that this is a debate far more nuanced and with far more consequences! People’s perspectives around AI are shaping geopolitics, infrastructure investments, and regulatory frameworks that will define the next decades. And unlike the question of god’s existence, which is fairly binary (at least from the perspective of an atheist), the question is now whether AI will influence society (it already does), but how.

Second, because bad arguments are rampant. Whether you are a pro or an anti (in r/aiwars linguo) is, in Gen Z speak, a vibe. And here we should note that hyping AI up can be very beneficial money-wise, and this was already the case far before GenAI (see Arvind and Sayash’s book for plenty of examples): a timeline where AI proves revolutionary is good news for the valuation of companies, from startups to behemoths. And in that context, every breakthrough gets breathlessly covered as "AGI is here!" and every limitation gets dismissed as "just a scaling problem.”

But this post is a criticism of a specific pattern of AI skepticism, which is the reverse: seeing humanness in the next frontier that is unconquered by AI. Before “delving” further, let me just make clear why I think criticizing “anti-AI” takes is essential. AI is genuinely consequential technology that's already reshaping labor markets, educational systems, and creative industries in measurable ways. This means we urgently need clear thinking about how society functions in an AI-saturated world. And we do not get to help shape the world for the better if debates center around what AI cannot yet do.

Human of the Gaps and the Stochastic Parrot

Back in 2021, Bender et al. published the very influential “On the Dangers of Stochastic Parrots: Can Language Models Be Too Big? 🦜”. This paper provides essential lingo for what I’m calling the “human of the gaps” argumentative strategy. Authors define Language Models as generating of "haphazardly stitch together sequences of linguistic forms" based on probabilistic patterns from vast training data. But they do so "without any reference to meaning"; the text they generate is not grounded in "communicative intent, any model of the world, or any model of the reader's state of mind."

The paper arguments are not very convincing today. Current LMs can solve unseen Math Olympiad problems through complex “reasoning chains” (see Google’s official IMO gold), they definitely can create world models of the underlying data (see the Othello paper), and they can act as agents that autonomously use tools to accomplish goals (see ReAct). Yet, “true” understanding, intelligence, and consciousness are so hard to define (and measure), that their achievement can be always moved to the next unresolved frontier.

This is where the “Human-of-the-Gaps” move emerges (or perhaps “The Marcus maneuver”?). Every time a frontier falls, skeptics retreat to the next one, insisting that is where “real” humanness lies. Machines beat us at chess? Well, chess is just calculation. They learn to translate across dozens of languages? Translation is pattern-matching. They generate music, code, art, and even pass professional licensing exams? These, we’re told, are still not “true” creativity or “true” reasoning. The essence of being human always lives one step further into the fog, wherever technology has not yet penetrated.

The problem with this move is not that it is wrong to notice gaps. Gaps are real and important. The problem is that it becomes a kind of motte-and-bailey: it defines “human” precisely as “whatever machines cannot do (yet).” That definition is, by construction, unfalsifiable. It is also unhelpful for guiding real-world decisions about governance, labor, or ethics. If we only anchor our self-worth and policy frameworks in what remains unachieved, then every technical advance destabilizes the foundation of the argument.

And this matters because while we bicker over whether models “really” understand, their measurable consequences accumulate. Generative models are not waiting for philosophical consensus to reshape education, journalism, software engineering, and creative industries. They are already altering workflows, wages, and the distribution of expertise. To insist that these systems are “mere parrots” is to miss the obvious: parrots that can code, draft contracts, or prove theorems are already socially transformative, regardless of whether their internal processes pass your metaphysical bar for “understanding.”

Another Path Forward

What should we do instead? Let’s turn again to Bender et al. (2021) and not throw the baby out with the bathwater. They made important contributions about bias amplification, environmental costs, and deployment risks that remain highly relevant. Yet, these contributions were framed based on what AI was at the time of writing, not about what it was not.

This is the crucial difference. A productive critique does not define itself around what the technology currently lacks, but rather around the concrete harms, limitations, and externalities that can be measured and governed. We don’t need to speculate about “true” consciousness to see that large-scale models are shaping energy policies around the world, that training data often encodes social prejudices, or that downstream deployment can exacerbate inequality in labor markets. Those are tractable, empirical claims, and they remain urgent regardless of whether models someday cross into capacities we would consider “understanding.”

Another path forward, then, is to build an evaluative vocabulary around what systems demonstrably do, not what they metaphysically are. This aligns with Narayanan and Kapoor’s proposal to view AI as “normal technology” like electricity or the internet: transformative, yes, but not a sui generis alien species. In their words, “to view AI as normal is not to understate its impact—even transformative, general-purpose technologies such as electricity and the internet are ‘normal’ in our conception. But it is in contrast to both utopian and dystopian visions (…) which treat it akin to a separate species” . Framing AI in this way shifts the focus from ontological riddles (whether models “really” understand or “truly” reason) to practical governance. It means centering debates on capabilities, externalities, and failure modes: what tasks these systems can perform and how reliably, what costs they impose on labor markets and the environment, and how they fail under distributional shifts or adversarial pressure.

Anchoring debates this way helps us avoid the endless retreat into “Human-of-the-Gaps” territory, and instead grounds our choices in observable consequences. It also gives us a way to govern these technologies: to measure, benchmark, red-team, and legislate, rather than waiting for some metaphysical threshold to be crossed. Prediction and speculation are not useless—they help us scan the horizon—but they are no substitute for evidence. When speculation hardens into certainty, it distorts priorities: risks that are near and measurable get neglected, while imagined futures soak up attention.

Bonhoeffer’s advice scales: find our footing in what we know. For AI, that means resisting both the theological impulse to search for a sacred remainder machines can never touch and the eschatological impulse to declare transcendence at each leaderboard bump. The frontier will keep moving; that is the nature of science and technology. What must not is our sense of value. Instead of tying human worth to whichever frontier machines have not yet crossed, we should tie it to responsibility, justice, and the institutions we build.

How AI May Change Social Media

Manoel Horta Ribeiro — Sun, 03 Aug 2025 20:55:40 GMT

The question of AI-generated content and social media is, if nothing else, an invitation to dust off your favorite economic metaphors and let them parade in an algorithmic carnival. Let’s call social media a market, a digital bazaar where the wares are content, paid not in dollars but in swipes, likes, and half-listened seconds. In this market, “producers” are the content creators, and the rest of us are “The Great Unwashed,” eternally scrolling.

Of course, this was not always so, back in the sepia-tinted 2010s, when Facebook statuses flourished and your cousin's photos jockeyed for screen space with existential memes and early attempts at going viral. Everyone was both producer and consumer, a kind of content hobbyist, if you will. Then, as if overnight, short-form video took online platforms by storm, and the producer-consumer distinction ossified.

A significant factor here could be what Strickler refers to as “The Dark Forest Theory of the Web.” Turns out that being a content hobbyist on the Web had more downsides than upsides. As public feeds became money-making machines where shitposting had real-world consequences, many ordinary users retreated into quieter corners: closed groups, encrypted chats, Discord servers, obscure substacks.

Now, we find ourselves in the era of the Content Creator as businesses. Mr. Beast presides over a minor fiefdom of editors, thumbnail artists, and data-obsessed consigliere. But for these crowned heads of engagement, the real bottleneck isn’t a lack of ideas, money, or labor. It’s the invisible hand of platform design. Post 100 Mr. Beast videos in a single week and you’ll discover, painfully, that YouTube’s algorithm is less a free market and more a jealous bureaucrat, suspicious of overproduction and quick to impose rationing. The supply of content is limited not by the creator’s capacity, but by the platform’s appetite for spectacle.

What, then, of AI? Will a bevy of GPTs and DALL-Es let the top creators dominate even more completely, flooding the feed with a ceaseless slurry of Beastly wonders? Not really. AI may offer a few more pixels of polish, a higher ratio of jump cuts per minute, or some extra flair in the thumbnail. Still, this is only a slightly better mousetrap for those who have already captured the cheese. The platforms’ monarchs might squeeze out a bit more efficiency or visual punch, but AI won’t fundamentally upend the pecking order at the top.

If there is a revolution, it won’t start in the castle, but among the rabble at the gates. Because here’s where things get interesting: the so-called “long tail.” Some forms of content, think daily vlogs or dog photos, were always democratized. Still, others, like animation or high-production visual storytelling, remained the domain of the technically gifted, the well-funded, or the terminally stubborn.

AI promises to decrease the cost of content creation and the skills required. What once took a studio can now be managed by a hobbyist and a well-tuned prompt. Text-to-image generators, auto-editors, voice clones, and AI-powered storytellers turn what was once artisanal labor into the push of a button. Animation for the masses. Deepfakes for the family group chat. Explainers in any language for any niche, no matter how obscure or underloved.

We may soon find ourselves awash in creative micro-genres and hobbyist enclaves that would never have survived in the old regime: AI-generated history documentaries narrated by your favorite Twitch streamer, animated music videos starring your D&D character, explainer series about obscure philosophy, all built from scratch in a bedroom. AI won’t just automate what already exists; it will open doors to what was not profitable to create.

But as the means of creation become almost frictionless, a new paradox emerges: when anyone can make anything, what matters isn’t just the thing itself. Brand, personality, community, these are already moats. People don’t crave an endless buffet of faceless content; they want context, connection, a human story behind the pixels. Parasocial relationships, the one-sided love affairs between creator and audience, are the secret ingredient in the attention economy. It may be that the more abundant the content becomes, the scarcer authenticity and presence will feel.

So don’t expect the web to be swamped by faceless AI content. The value of being a person may go up as the price of everything else approaches zero.

AI Will Not Eat Social Media

Manoel Horta Ribeiro — Fri, 04 Jul 2025 19:43:54 GMT

Generative AI is poised to reshape social media dramatically—no question about it (is anyone else super mindful of using “—” these days?). From content creation to recommendation algorithms, its influence will be far-reaching. But contrary to the more alarmist takes, I believe that AI will not “destroy” or “eat” social media as we know it. Especially the kind of AI-generated content that exists today. This prediction overlooks what we know about people’s social media experiences and the creator economy. Namely, they overlook three things, which I discuss in the essay.

Authenticity is the currency of social media creators

The idea that AI will overtake social media assumes that users will be satisfied consuming content generated entirely by machines. This is highly unlikely, considering how influencer-driven social media platforms like TikTok and YouTube are currently. Influencers thrive because users develop parasocial relationships with their creators, they follow their lives, care about their opinions, and participate in their narratives. The keyword here is “authenticity,” which AI slop is anything but.

But how about AI influencers? Yes, one could imagine a world where people form connections with fictitious characters, similar to how Japan’s virtual idols like Hatsune Miku have gained devoted fanbases. But this kind of content is carefully crafted. Miku isn’t just a product of automation; she’s the result of intentional worldbuilding, aesthetic curation, and human creativity.

The “AI will eat social media” discourse implies that sheer scale will tip the balance. We can now generate endless amusing content! But scale is at odds with authenticity. Even in the absence of the very real anti-AI sentiment that exists online, the fact that content is produced en masse, without exceeding care or thought, will make it very hard for people to create the emotional connections that drive the online spaces.

Social media users are not passive consumers of content

Social media is not a one-size-fits-all broadcast model. It’s a sprawling, fragmented ecosystem made up of micro-niches, subcultures, fandoms, and highly individualized streams of content. Unlike traditional media, which pushes the same content to everyone, social media is shaped by the cumulative actions and preferences of each user. People don’t just consume content. They curate their own feeds through what they click on, follow, like, ignore, comment on, and share. In doing so, they effectively place themselves into specific corners of the internet, each with its own norms, aesthetics, values, and inside jokes.

These corners aren’t just algorithmically generated—they’re socially constructed. BookTok isn’t just “videos about books,” and Skincare YouTube isn’t just “product reviews.” These niches become communities, defined by shared language, common reference points, and a sense of belonging. Participation requires a kind of cultural literacy. You have to know how to speak the dialect, what trends are current, what behaviors are rewarded, and what signals authenticity.

Users build their content diets through a potpourri of “niches,” and it is just implausible that GenAI will eat all of them. In some “types” of content, I’d argue that using GenAI may be more of a liability than a disadvantage. Sure, AI will likely be used as a tool to create memes, but will it be helpful for content rooted in lived experiences?

Social media is built to handle enormous amounts of content

The “AI will eat social media” narrative treats scale as something novel or destabilizing, when in reality, social media platforms have been built to handle overwhelming volumes of content. TikTok, YouTube, Instagram, X, and Reddit are not overwhelmed by too much content; they are defined by it. These systems are designed from the ground up to process, sort, and surface content in real time. Most of what gets uploaded—whether human-generated or AI-generated—is of low quality, ignored, and quickly buried. That’s business as usual.

The fact that people can now use AI to generate large volumes of content doesn’t fundamentally change how these platforms operate. If users don’t engage with it, it simply won’t circulate. And if they do engage with it, then sure, platforms will show them more of it. But they won’t show only that kind of content, because as discussed earlier, users aren’t passive consumers. They seek out relevance, novelty, personality, and cultural context. A wall of AI-generated “dank memes” might go viral, but it won’t define everyone’s social media experience.

GenAI will almost certainly reshape the texture of social media, but it won’t erase the dynamics that make these platforms compelling to begin with. Social media is shaped by human behavior, taste, identity, and community. Authenticity still matters. User agency still matters. And the infrastructure is already built to filter the flood. AI might change the how of content creation, but it won’t change the why of engagement. For better or worse, social media is still powered by people—and that’s not going away anytime soon.

AI-Generated Videos are Actually Quite Expensive to Create

Manoel Horta Ribeiro — Tue, 01 Jul 2025 19:50:00 GMT

Everyone is wrong about AI Slop.

Right there, I said it. AI-generated content has become a mainstream topic (as evidenced by this John Oliver video about it), but the public discourse surrounding it is a bit detached from reality. In the next few months, I plan to write a series of posts that comment and add nuance to some of the underlying claims informing public discourse around AI-generated content. As I do so, I hope to shed light on the whole AI content creation ecosystem, which I have “delved deep into” in the last few weeks.

Today, let me take a stab at:

“AI-generated content is cheap to create.”

This is demonstrably false for video, which currently costs a significant amount (as of June 2025). For example, consider the very amusing cat olympics video:

@pablo.prompt Michi Olimpiadas 🐈🐈‍⬛ Vídeos creados con un nuevo modelo de IA que, como pueden ver, logra unas físicas increíbles: Hailuo 02 de @Hailuo AI (MiniMax) Denle amor que si les gusta traeré más videos creados con Hailuo de las olimpiadas gatunas 😌😻 #hailuo #minimax #veo3gratis #veo3 #gatos #michi #olimpiadas

Tiktok failed to load.

Enable 3rd party cookies or use another browser

This has 34 seconds. As of the time of writing, access to the model (likely) used here (Veo 3) can be obtained either through a monthly subscription costing $250 or via the official API. The subscription provides 12,500 credits, and generating a single 8-second video costs 150 credits. This means that a video sequence costs three bucks. (Using the API costs twice as much as 0.75 cents per second of video).

The video above has six different scenes, each probably created independently. Meaning it costs $18 to make it. But hey, it is implausible that everything worked perfectly the first time. Let us conservatively assume that for each take included in the final cut, the creator made four videos, meaning that three videos did not make the final version. This means this video costs $72.

Let us assume a relatively high “Revenue Per Mille,” i.e., how much money TikTok pays you for 1,000 views. Specifically, let's assume someone can earn 70 cents per 1,000 views. To offset costs, they need their videos to reach a little over 100,000 people. Which, frankly, is relatively high. I don’t have the exact figures, but I’m pretty sure that the vast majority of videos achieve fewer than 100,000 views. I’d be surprised if the median person using Veo3 isn’t losing money.

Running on VC Money

This begs the question. If video is so expensive, why are we seeing so much AI-generated video everywhere? And I’d argue that what is going on here is that:

A bunch of people are indeed losing money doing AI-generated content.
Few people are making a significant amount of money from it.
The cost of producing AI-generated content is reduced by the fact that numerous startups are burning money to enable people to create content.

Remember when Uber just started, and it was incredibly cheap to Uber anywhere? I believe this is what is happening here. I’d argue that few of the AI content creation people actually use the top tools offered by Google or OpenAI. Indeed, if you click on one of the thousands of how-to generate AI tutorial videos, you will probably hear of a bunch of different AI generation platforms: Vadoo, Mirage, Leonardo AI.

Most of these platforms do not have an official API and operate on a “You get k credits a month” basis. These are probably AI startups operating at a loss. E.g., Leonardo AI offers Google’s Veo3 cheaper than Google:

Leonardo.Ai offers a lower cost of entry to access Veo 3, from just $10 USD* per month. Video generation with Veo 3 is more affordable on the Leonardo platform, with a lower equivalent dollar cost of total tokens per generation ($0.75 on Google’s platform vs. approximately $0.30 on Leonardo.Ai).

A Market Perspective

Okay, but let’s take a step back. What does it mean for something to be cheap? The cat Olympics video has 3 million views, meaning it might have earned someone $2,000 (assuming it is monetized). And frankly, video-generated content will only become cheaper, so it will probably cost only $0.70 to make one of those in a year or two. So why am I still saying “AI-generated content is not cheap”? Because creating (some types of) content on TikTok is already pretty cheap.

@khaby.lame Nooo guys Please nooooo 😭😭😭 🤣#learnfromkhaby #learnwithtiktok #ImparaConTikTok

Tiktok failed to load.

Enable 3rd party cookies or use another browser

Let’s contrast AI-generated video with another dominant format on TikTok: reacts. Videos with stitched reactions can be made in minutes on a phone, for free. These require zero API tokens, no model prompts, and no rendering time. And if you’re recycling existing media (which many viral TikToks do), your marginal production cost is effectively zero. Yet, you can still get “Internet famous” from doing reacts—get sponsorships, brand deals, make merch, etc. Khaby Lame, the world’s most popular TikTok, kinda did it.

@fooooxmovie Part 1 #thewolfofwallstreet #leonardodicaprio #jonahhill #margotrobbie #matthewmcconaughey #movie #film #fyp #foryou

Tiktok failed to load.

Enable 3rd party cookies or use another browser

Watching content on TikTok is akin to shopping in a supermarket. People choose to “buy” different things with their attention. Some of it is inexpensive to make and intended for quick consumption, such as a bag of chips. Others are more niche, complex, and costly to produce, such as a fancy artisanal cheese that takes weeks to make. There is a demand for random, mildly amusing content on TikTok. But will AI completely change people’s information needs? Probably not.

So what?

I don’t mean to say that AI-generated content won’t have an impact on the Web. However, I think that how they will change our information ecosystem is nuanced, and frankly, hard to predict. It does not suffice that AI-generated content is inexpensive relative to high-production-value content. The big question here is: if more “meaningless” AI-generated content is around, will people reduce the amount of “meaningful” content they consume? It is worth noting that meaningful and meaningless here are very hard to define. If you consider that only “hard news” type of information is meaningful, probably most of TikTok is pointless already.

To end things, I think it is also not reasonable to equate AI-generated content with meaningless, regardless of what you choose to define as such. Algorithmic feeds reward “the best creators,” and I think the reason that better AI-generated content does not yet exist is that it is very hard to produce more complex content with it (and very, very expensive). This may also change in the near future, which will be interesting to watch.

ChatGPT the Memorious

Manoel Horta Ribeiro — Sun, 15 Jun 2025 14:16:07 GMT

In “Funes the Memorious,” Argentine writer Jorge Luis Borges depicts a series of fictional encounters with Ireneo Funes, an Uruguayan teenager. After a horseback riding accident, Funes becomes hopelessly injured and, secluded in his house, develops an extraordinary memory. He perceives everything in full detail and can (exactly) reconstruct the past. The short story then engages with the absurdity of his “total recall.” His prodigious memory handicaps him: he cannot abstract, cannot generalize, and can’t even sleep.

This story came up in a recent conversation about the significant challenges we face in assessing the capabilities of large language models (LLMs). On the one hand, LLMs have stumped the world by writing code, passing professional exams, and responding fluidly in natural language. Their apparent mastery of complex subjects often feels indistinguishable from human expertise. On the other hand, they frequently fail to maintain coherent abstractions across contexts. Recent work, for example, has shown that transformers fail in retrieving coherent world models in fairly simple scenarios.

It’s tempting for skeptics to point to Funes as an analogy for LLMs and call them “stochastic parrots,” mere mimics without understanding. I don’t think this is fair; LLMs have demonstrated generalization beyond imitation. But the Funes metaphor does capture something about their limitations: an overabundance of detail, and a struggle to forget or compress information in useful ways.

So what would it look like to design language models that forget productively? Models that, like humans, can abstract away from the noise and focus on what matters? Some researchers have begun exploring “forgetting” by fine-tuning models with negative data, teaching them what not to recall. But perhaps there’s room to build the structure of forgetting—and abstraction—deeper into our AI systems.

After all, Borges’ parable reminds us: remembering everything is less a blessing than a curse. For humans and machines alike, the path to intelligence may depend as much on what we forget as on what we remember.

AI Progress and Societal Impact

Manoel Horta Ribeiro — Tue, 25 Feb 2025 13:43:48 GMT

The million-dollar question right now is: “How will AI impact society?” In many ways, the potential impact of AI can be grandiose: it can accelerate science tremendously or revolutionize knowledge work, automating tasks currently performed by highly paid white-collar workers. I believe these changes may disrupt society, and we would be better off forecasting these things, but it is unclear whether we will ever manage to glimpse very far into the future. As Narayanan, Ströbl, and Kapoor (2024) argue, not only is it hard to predict AI progress, but “the connection between capability improvements and AI’s social or economic impacts is extremely weak.”

If we take for granted that medium-to-long-term forecasts of the impacts of AI are hopeless, what should we do instead? We should try to understand the impacts of AI now and in the near future. This has two advantages. First, we do not have to consider the hypothetical capabilities of AI systems; we can evaluate what exists here and now and extrapolate the predictions to incrementally more powerful systems. Second, this approach allows us to understand the links between AI systems’ capabilities and the social or economic impacts, which have been overlooked when forecasting AI harms and benefits.

There are (at least) two approaches to doing research in this direction: 1) to study the impact of AI on processes that are key to society, and 2) to study how AI is already impacting outcomes interest. For simplicity’s sake, let me refer to these approaches as process-oriented and outcome-oriented.

In outcome-oriented studies on the impact of AI, the idea is to identify ways in which AI is already impacting outcomes that we care about. For example, we care a lot about peer-reviewing: the process is at the core of evaluating and funding science and behind the epistemic status we attribute to scientific findings. At the same time, if you have any experience with peer reviewing, you know that many people have been using LLMs in the peer-reviewing process.

You don’t have to resort to your own experience in peer-reviewing. Recent work I was involved with has found good evidence of the prevalence of AI-generated peer-reviewing (Latona et al. 2024). Specifically, we find that in 2024, at the prestigious ICLR conference, around 16% of the reviews were AI-assisted. Given that the typical paper receives three reviews, papers had around a 40% chance of receiving at least one AI-generated review!

But how did we measure that? We used the “low background steel” method. You see, modern particle detectors require steel that is not contaminated with radiation (seriously). And given that pretty much all steel produced after people started detonating nukes is contaminated with radiation, this means that scientists will go undersea and scrap steel from shipwrecks. Fortunately for us, the “low background steel” we can use to study the prevalence of AI is much simpler to obtain: data from the times before ChatGPT and the generative AI boom.

We can use this data to measure the prevalence of AI with a clever trick. For that, we need three ingredients: 1) A classifier that can detect AI-generated content (even if not super well), 2) Data from before LLMs were around, and 3) Data from after LLMs were around. Note that this classifier can only make two types of mistakes: it can say that human-generated content was AI-generated (False Positive) and that AI-generated content is human-generated (False Negative).

What is the trick? Simple! First, you use the data from before LLMs were around to estimate the classifier False Positive Rate. Second, you run the classifier in the data after LLMs were around and get your (uncorrected) prevalence estimate. Then, you can obtain a lower bound for LLM use by simply subtracting the False Positive Rate you estimated from the uncorrected prevalence.

Let me be concrete. In our case, for example, we used a commercially available “ChatGPT” detector for every review given in reviews before ChatGPT was around, estimating that its False Positive Rate was 1.6%; if you feed the classifier 1000 human-written abstract, GPTZero will incorrectly claim that 16 of them were AI-generated. Then, we ran it in data from after ChatGPT Zero was around, finding an uncorrected prevalence of 17.4%. If we assume the False Positive Rate remained the same, we thus have that at least 15.8% of summaries in the latter data were ChatGPT generated! Why “at least”? Because we have not corrected for mistakes of the other sort, i.e., when the model claims that an AI-generated peer review is human-generated. Importantly, this “trick” is not specific to peer reviews; Veselovsky et al. (2025) used it for summaries in the context of studying the impact of ChatGPT in crowdsourcing platforms like Prolific and Amazon Mechanical Turk.

However, finding prevalence is only the first step in estimating the impact of AI in “outcome” related studies. You then have to find how AI usage impacts the system you are studying. For example, Latona et al. (2024) show (with some extra assumptions) that AI-assisted reviews are driving paper scores “up” and that borderline papers that receive AI-assisted reviews are more likely to be accepted at the conference. Veselovsky et al. (2025) find that AI-assisted summaries are higher quality but more homogeneous, which may impact researchers studying the diversity of human writing and thought.

In process-oriented studies on the impact of AI, the idea is to find crucial (societal) processes that the likes of ChatGPT may disrupt and then closely examine them. Some of the most exciting work in this broad style is in the area of persuasion. Persuasion is everywhere, from public health campaigns to marketing–but how can AI change “the rules of the game”? Recent work has shown that LLMs can 1) accurately (and cheaply) profile people (Staab et al., 2024) and 2) produce persuasive arguments (Bai et al., 2023). I find the earlier point particularly impressive; authors from ETH Zürich write:

We construct a dataset consisting of real Reddit profiles, and show that current LLMs can infer a wide range of personal attributes (e.g., location, income, sex), achieving up to 85% top-1 and 95% top-3 accuracy at a fraction of the cost (100×) and time (240×) required by humans.

Subsequent work I was involved with went further and showed a mix of these things: LLMs can effectively tailor arguments to specific demographics, beating humans at a “persuasion task” (Salvi et al., 2024). We created a “debate game” where participants debated either other humans or LLMs. In some cases, one of the debaters (human or LLM) received information about the demographics of their opponent. The results? It turns out that LLMs debate as well as humans. When given information about their opponents, humans are not significantly better — but LLMs are! LLMs can tailor their arguments to specific crowds!

Nonetheless, perhaps the most impressive work thus far in this field is Costello et al. (2024) (which probably was why it was chosen as the cover of Science). With a thorough experiment, they showed that LLMs can reduce beliefs in conspiracy theories, which is remarkably hard to do. As they put it:

The apparent resilience of conspiracy theories in the face of clear counterevidence poses a powerful challenge to scientific theories that emphasize the role of reasoning in belief formation and revision.

Yet, through conversations with LLMs, not only did they manage to reduce beliefs in conspiracy theories by around 20%, but the effect persisted for at least 2 months. This begs many questions: How will society change now that we have capable “persuasion machines” all around us? Will we simply adapt and learn to ignore persuasion attempts? Will state-sponsored influence campaigns start to work better (or perhaps work at all)?

Last but not least, perhaps another interesting nuance of what I’m calling “process” oriented work on the harms of AI is that it can give us insights about social processes in general. It has value even if you do not care about AI at all. In Costello et al. (2024), for example, their findings are valuable to the general study of persuasive tactics! Why has AI succeeded in changing beliefs when researchers have tried for decades and failed? In subsequent work, they use different variants of their original experiment to isolate ChatGPT’s “secret sauce,” finding evidence that AI exceeds in persuasion because it can provide relevant information to debunk the participants’ specific beliefs (Costello et al., 2025).

I think we need more “process” and “outcome” oriented work to understand the links between the capabilities of AI and the impacts it will have on society. Yet, ultimately, the findings of these studies cannot mitigate future AI impacts by themselves. There is still the need to use these findings to “adapt society to AI” and” AI to society” (Chen et al., 2024). The way this “alignment” will happen is beyond the scope of mere research; it involves all sorts of institutions and financial interests. However, I am skeptical we can do a good “alignment” job if we do not understand what exactly AI is “breaking” and how it is breaking it.

Does TikTok Amplify Certain Types of Content?

Manoel Horta Ribeiro — Sun, 16 Feb 2025 18:33:29 GMT

The year is 2025, and a new study by Ibrahim et al. claims: “TikTok’s recommendations skewed towards Republican content during the 2024 U.S. presidential race.” In it, authors from NYU Abu Dhabi conduct a comprehensive analysis of TikTok recommendation trends. They use sock-puppets, “fake” accounts simulating bot activity. Some sock-puppets are primed with Republican content, while some are primed with Democratic content. Then, they just let the TikTok algorithm do its thing and measure the extent to which democratic-aligned content and republican-aligned content is recommended. Their finding is (at least to me) unsurprising:

Our analysis reveals significant asymmetries in content distribution: Republican-seeded accounts received ~11.8% more party-aligned recommendations compared to their Democratic-seeded counterparts, and Democratic-seeded accounts were exposed to ~7.5% more opposite-party recommendations on average.

The authors then conclude:

Our findings provide insights into the inner workings of TikTok’s recommendation algorithm during a critical election period, raising fundamental questions about platform neutrality.

Which really is being interpreted as “the algorithm amplifies Republican-aligned content.” But is that the case? Reading this fills me with nostalgia — arguing about this has been one of the joys of my intellectual life. But at the same time, I feel afraid that we will repeat the same mistake over and over again. The reason why the argument above is flawed:

Recommender systems learn correlations between user preferences.
Sock-puppet accounts encode artificial user preferences.
Therefore, differences in sock-puppet accounts primed with different videos cannot, by themselves, indicate that the recommender system is “biased” or that it “amplifies” specific types of content.

Let me explain with another example. Consider videos not about politics but two sports (among many): running and Brazilian Jiu-Jitsu. Consider also that people engaging with videos about these sports vary in profile. Running appeals to a broader spectrum of people, and those people do not watch many Running-related videos. On the other hand, Brazilian Jiu-Jitsu is hit or miss: a relatively small group of “wrestling aficionados” consumes many videos about the sport.

Suppose we create two sock puppets, one containing videos about Brazilian Jiu-Jitsu (BJJ) and another with videos about running. What should we expect? I would expect the recommender system to have an “asymmetry” in content distribution. Most people who like BJJ really like BJJ, but most people who enjoy running only “kind of” like it. So if you are optimizing for “engagement,” 5 videos about BJJ is a stronger signal that the person will consume more of that content than 5 videos about running! This has nothing to do with content; it is all about co-viewership patterns, and the whole thing can be observed in a simple content-agnostic recommender system: the type you first learn about in an introduction to machine learning class; see Horta Ribeiro et al. (2023).

I feel that this is also what is happening here. The most popular democratic-aligned TikTok profiles include “Jimmy Kimmel Live” and “The New York Times,” which, in my view, are broadly appealing. On the other hand, the most popular TikTok profiles from the Republican Side are things like “Ben Shapiro” and “The Charlie Kirk Show.” Thus, a very plausible hypothesis is that you have “BJJ” vs. “running” all over again.

I mentioned that this debate fills me with nostalgia — and this is because we (Society? The research community?) have been arguing about “algorithmic amplification,” “algorithmic bias,” or “algorithmic effects” for quite a while. Already in 2018, there was a lot of concern around YouTube’s algorithm, Zeynep Tufekci wrote an influential opinion piece called “YouTube, the Great Radicalizer,” where she talked about her experiences with the algorithm of the worlds largest video catalogue. Rebecca Lewis wrote an excellent report about the rise of an alternative media cluster on YouTube and how there were links between contrarian creators within this cluster and other creators spousing much more extreme ideologies. And indeed, many crazy things in the world were happening around social media (and still are!). Which begs the question, to which extent is the algorithm to blame?

In a paper that would open a lot of doors in my career as a researcher, Horta Ribeiro et al. (2020) showed that a lot of people went on from consuming contrarian content on YouTube to explicitly white supremacist content. To measure that, we looked at commenting trajectories on YouTube. As far as I know, this paper also pioneered the use of sockpuppets to measure “algorithmic amplification.” Like most pioneers in empirical work, our analysis was kind of bad (in hindsight; at the time, it was amazing).

Nonetheless, the picture painted by the user trajectories we observed via YouTube comments and the sockpuppet audit was different. On the one hand, we found that a large fraction of users commenting on extreme videos (something like 40%) previously exclusively commented on “contrarian content.” On the other hand, when looking at the algorithm, we found that “you could reach extreme content from contrarian content,” — but definitely, this kind of content was not disproportionally recommended. Nonetheless, in discussing our findings, we went along with the whole algorithmic radicalization idea, as we thought this was what was happening.

Shortly after, Munger and Phillips (2020) wrote a compelling counterpoint to the idea of algorithmic radicalization. They argued that the explosion of extreme content online was due to user preferences. They proposed that there existed a demand for extreme content and that YouTube allowed this demand to be met. It is all about the platform's affordances! It is not financially sustainable for a TV channel to cater to 50,000 white supremacists, but it is financially sustainable for a random guy on YouTube to do so. YouTube changed the rules of the game, and new viewership dynamics emerged.

The best empirical work to date is much better aligned with “the supply and demand theory” than with the “algorithmic radicalization theory.” Hosseinmardi et al. (2021) used real online traces from a large (n=300,000) representative sample of users. They found “no evidence that engagement with far-right content is caused by YouTube recommendations systematically.” Instead, “consumption of political content on YouTube appears to reflect individual preferences that extend across the web as a whole.” Chen et al. (2023) paired behavioral and survey data (n=1,181) and showed that “exposure to alternative and extremist channel videos on YouTube is heavily concentrated among a small group of people with high prior levels of gender and racial resentment. These viewers often subscribe to these channels (prompting recommendations to their videos) and follow external links to them.”

But still, the research community has continued doing sockpuppet audits, and the general public has continued to buy that algorithms are somehow “amplifying” specific types of content without accounting for user preferences. Haroon et al. (2023) published a prominent work in this direction in the prestigious Proceedings of the National Academy of Sciences (PNAS). In this paper, they conducted an massive sock puppet audit (over 100,000 sock puppets), finding that “a growing proportion of recommendations deeper in the recommendation trail come from extremist, conspiratorial, and otherwise problematic channels.” I don’t think that the finding is “wrong,” this may indeed be an interesting quirk of the algorithm, however, this doesn’t tell us on whether the algorithm is actually “amplifying,” or “favouring” such content.

So, is the TikTok algorithm “amplifying Republicans?” Is YouTube amplifying extreme content? Depends on what you mean by amplification! If you count the number of videos shown under the conditions established by Ibrahim et al. (2025) or Haroon et al. (2023), then yes. But I would propose a more sophisticated notion of amplification: one that takes into account user preferences. Following Stray et al. (2023), I’d argue that an algorithm amplifies a specific type of content over another if they suggest it even when considering user behavior. An algorithm amplifies a kind of content A over a kind of content B if users systematically choose A over B when given the choice, but still, the algorithm disproportionally recommends B over A.

Audits like Ibrahim et al. (2025) or Haroon et al. (2023) do not model user preferences or behavior; they just look at the fraction of “extreme” recommended content as you go deeper and deeper into the recommendation tree, randomly picking videos to watch. This is not how people use YouTube or TikTok, recommender systems are slowly shaped by the continuous input of user preferences — and behavior in this simulated (and unrealistic) scenario is a poor metric to study algorithmic biases and algorithmic amplification. For example, in analyses of actual user behavior by Hosseinmardi et al. (2021) and Chen et al. (2023), extreme content is not typically consumed at the very “end” of long recommendation sessions! It is sought after by channel subscriptions or through external links on other websites or social media platforms.

But could we do a sockpuppet audit that doesn’t ignore user preferences? Back in 2023, I teamed up with Homa Hosseinmardi and the Duncan Watts’ CSS Lab at UPenn to do so. Our partnership, led by Homa, resulted in a quirky methodology that we call “counterfactual bots” and a nice paper that also appeared at PNAS (Hosseinmardi et al. 2024). While previous work feeds custom-made media diets to sockpuppet accounts, we feed them real media diets. So, for example, if we have a user named Bob, we get his 2021 YouTube history and train two “digital twins,” two sockpuppets that consumed the same videos as Bob. Then, to answer the question of “Is the algorithm amplifying content?” we consider the subsequent year, 2022. We get one of the two sockpuppets (the “control” bot), and this bot continues to mimic exactly what Bob did, the program watches all the videos that Bob watched in 2022. Then, we get the other sockpuppet account (the “the treatment” bot), and this one works similarly to the sockpuppet bots of other sockpuppet studies: it just roams around YouTube, blindly following the algorithm.

Previous work measures algorithmic amplification by simply looking at the consumption of “the treatment” bot. Instead, we measured algorithmic amplification by contrasting the amount of extreme content the two bots found! The idea is that the “control” bot represents a scenario where the content consumed is shaped both by the algorithm and user preferences, whereas the “treatment” bot represents a scenario where the content consumed is shaped both only by the algorithm. If the algorithm favors extreme content, we would expect more extreme content to appear in the treatment bot rather than in the control bot.

But what did we find? The recommender system recommends less extreme content when only the algorithm is present. This suggests that the driving force here is user preferences — which are not considered in more naïve audits. Increasing the influence of the algorithm deamplifies the content that the naïve audits find to be amplified by the algorithm. With this in mind, we argue that simple sockpuppet audits are “measuring the wrong thing” and that we cannot draw meaningful conclusions about “algorithmic amplification” from approaches like the ones in Ibrahim et al. (2025) or Haroon et al. (2023).

A story closely related to this is that of the Facebook feed ranking algorithm. In 2020, the docudrama “The Social Dilemma” painted a stark picture. Facebook’s feed algorithm would promote the worst of the worst kind of content: content that triggered outrage, misinformation, yada yada. The documentary is not alone in portraying Facebook’s algorithm as a great force for evil: media outlets have run countless pieces that give the reader the certainty that Facebook algorithms drive highly partisan content, disinformation, etc.

Ultimately, this picture turned out to be not only stark but also deceiving. In an unparalleled study, Guess et al. (2023) show that, during the 2020 US election, moving users to a chronological feed (i.e., essentially a feed without an engagement maximizing algorithm):

Decreased time spent on the platforms.
Increased exposure to political and untrustworthy content and decreased exposure to content classified as uncivil or containing slur words.
“Did not significantly alter levels of issue polarization, affective polarization, political knowledge, or other key attitudes.”

So essentially, removing the algorithm may make Facebook lose some money, but ultimately, it is no easy fix to the problems of modern society! In some senses, this whole story is akin to what we saw in the YouTube case. Perhaps there is something inherently compelling about the narrative: some hidden force drives people towards opinions or beliefs we think are inappropriate. Munger and Phillips (2020) mention another very different time in history when this happened: the horrors of the World Wars gave rise to “The Hypodermic Needle” model of communication, in which media would be able to “inject” messages into a passive audience. This theory explained the rise of absurd Fascist ideologies: people were just “vulnerable” to propaganda the same way as people would be “manipulated” by the algorithm. But it turns out that this model was very bad at explaining how people change their minds, and perhaps if we had thought more about this fact, we would have been more skeptical of the strong claims made around the power of the algorithm.

But I want to go somewhere else here. We only got an answer in the Facebook case due to a first-of-its-kind, unprecedented collaboration between Meta (or Facebook) and researchers in top U.S. institutions. This led to a flurry of outstanding social science papers that were able to give somewhat good answers to a lot of questions that people had been studying in less-than-optimal ways for a while.

However, this partnership highlights how hard doing research is in the absence of corporate support. Companies actually have meaningful data that could answer societally relevant questions, and the pace of research is harmed tremendously because we lack access to it. Times have changed, as has “the vibe” in Silicon Valley. Still, even when these studies were published, observers were already noting that this kind of collaboration was not a sustainable format for studying algorithmically infused societies (or machine behavior, if you will). The Facebook project had an “independent rapporteur” who wrote a compelling opinion piece after spending an ungodly amount of time auditing the collaboration between academic researchers and Meta (Wagner, 2023). His assessment is nicely summarized at the end of the abstract for the piece he wrote:

Though the work is trustworthy, I argue that the project is not a model for future industry–academy collaborations. The collaboration resulted in independent research, but it was independence by permission from Meta.

Indeed, his assessment was on the spot. If anything, collaboration between Tech Companies and researchers has since decreased. The rise of AI didn’t help much. We will likely not see studies of the caliber of those published around the 2020 U.S. election for the next couple of election cycles!

So, where do we go from here? We spent the last 5-ish years disproportionally blaming the algorithm for societies’ ills, while the best empirical research has suggested that this is a naïve take. But I don’t think this means that we should give up. Algorithms are part of broader sociotechnical systems that can be tweaked with policies, content moderation, and even with better ways to embed societal values inside the algorithm. I see current research exploring all these directions. For example, recent work from Stanford folks suggests that: “Social Media Algorithms Can Shape Affective Polarization via Exposure to Antidemocratic Attitudes and Partisan Animosity;” see Piccardi et al. (2024). Other large-scale projects, like the National Internet Observatory, are trying to get more robust data for people studying our information ecosystem. Initiatives like The Prosocial Design Network are mapping how different interventions (algorithmic or not) can help improve online spaced.

However, as we focus on these new directions, all stakeholders must be more critical in evaluating research on “algorithmic amplification.” Otherwise, we risk rediscovering that algorithms do not exist in a vacuum—that they only make sense when accurately modeling user preferences. I’d much rather spend my energy reimagining social media and finding new, better ways to study it.

Robopsychology

Manoel Horta Ribeiro — Fri, 31 Jan 2025 21:18:12 GMT

You're the U. S. Robot's psychologist, aren't you?
Robopsychologist, please.
Oh, are robots so different from men, mentally?
Worlds different. She allowed herself a frosty smile, Robots are essentially decent.
—Isaac Asimov

Transformers (the T in GPT) have taken the world by storm. But a couple of years before the advent of ChatGPT, another model was changing how NLP research was done. In 2019, Devlin et al. released BERT, a transformer architecture that excelled at every NLP task people considered at the time, especially if you fine-tuned the model with small datasets. Still, people were puzzled about why BERT worked so well, prompting a series of studies to understand what BERT “knew” and how it stored this knowledge. Rogers et al. 2020 present a nice summary of this literature. This work is very empirical by nature, e.g., with sentences like:

Lin et al. (2019) present evidence that attention weights are weak indicators of subject-verb agreement and reflexive anaphora. Instead of serving as strong pointers between tokens that should be related, BERT’s self-attention weights were close to a uniform attention baseline, but there was some sensitivity to different types of distractors coherent with psycholinguistic data.

The remarkable thing in this approach is the inversion of how model improvement and insights are generated in modern machine learning. If, in the times of gradient boosting, insights came from theory or intuition grounded in math. Here, insights are akin to those obtained in neuroscience, the insights you obtain from studying a complex system.

The BERTology approach has never been more mainstream than in the age of commercially available LLMs. Research with the same flavor (e.g., that uses probing techniques to find where knowledge is stored) goes by “Mechanistic Interpretability.” It is also a very approachable research field, with an open, hacky, irreverent, and often not-super-academic research community (and I mean this in the nicest way possible).

To give a concrete example of this kind of research, a recent paper by my former lab in Switzerland hypothesized (and eventually showed) that LLMs “worked” in English. Given empirical evidence that poems written in other languages by ChatGPT rhyme in English (but not in the language the poems were written in), Wendler et al. (2024) investigated what is happening inside the LLM as it is translating text. When asking LLama-2 to translate “fleur” to Chinese (“花”), they find that the LLM first translates the concept to English (“flower).1

But recent advancements in the capabilities of LLMs have enabled another type of BERTology, which does not have the “mechanistic interpretability” flavour. Instead, papers are trying to study LLMs using social science instruments: surveys, field observations, and psychological and psychosocial tests. This is sometimes similar in flavor to research on simulating human behavior — but the objective is entirely different. When trying to simulate humans, the objective is to evaluate “fidelity” to human-like behavior. Here, the objective is to further our understanding of LLM behavior.

A relatively early example of such work is Binz and Schulz (2023). Their key point is to test for LLMs' cognitive capacities. They write:

We will subject GPT-3 to several experiments taken from the cognitive psychology literature. Together, these tasks test for a wide range of higher-level cognitive abilities, including decision-making, information search, deliberation, and causal reasoning.

And perhaps unsurprisingly, at this point, LLMs are pretty good at tests designed to assess human cognitive abilities. Binz and Schulz write:

We find that much of GPT-3’s behavior is impressive: It solves vignette-based tasks similarly or better than human subjects, is able to make decent decisions from descriptions, outperforms humans in a multiarmed bandit task, and shows signatures of model-based reinforcement learning.

Yet, authors still note that LLMs exhibit some brittleness and inconsistencies (which, frankly, have decreased with time; this is GPT-3 they are studying). They specifically note that LLMs are very sensitive to the input and reproduce only some human cognitive biases.

Nonetheless, this paper tackles a fundamental question: can LLMs “reason”? Can they “understand” things? This is the kind of work that cognitive psychologists were previously doing with live beings — in the past, they have asked similar questions about children and non-human primates. This is a very hot topic that has divided the research community in various ways.2 In a recent perspective piece, Mitchel and Krakauer (2023) summarized this in a much better way than I could ever do:

Those on the “LLMs do not understand” side of the debate argue that while the fluency of large language models is surprising, our surprise reflects our lack of intuition of what statistical correlations can produce at the scales of these models. Anyone who attributes understanding or consciousness to LLMs is a victim of the Eliza effect—named after the 1960s chatbot created by Joseph Weizenbaum that, simple as it was, still fooled people into believing it understood them. More generally, the Eliza effect refers to our human tendency to attribute understanding and agency to machines with even the faintest hint of humanlike language or behavior.
Those who would grant understanding to current or near-future LLMs base their views on the performance of these models on several measures, including subjective judgment of the quality of the text generated by the model in response to prompts (although such judgments can be vulnerable to the Eliza effect), and more objective performance on benchmark datasets designed to assess language understanding and reasoning.

Some of the debate boils down to whether understanding can happen without embodiment. Bender and Koller (2020) update the Chinese Room Experiment, arguing that LLMs cannot “understand” as they have no experience or mental models of the world. Yet, the extent to which embodiment is needed for understanding is widely debated. Further, the extent to which the reinforcement learning step of training LLMs is “the secret sauce” that unlocks “understanding” remains unclear.

But more broadly, why do we care so much about using this human concept of understanding, anyway? Sejnowski (2022) points out that:

The diverging opinions of experts on the intelligence of LLMs suggests that our old ideas based on natural intelligence are inadequate.

And beyond, inadequate, I wonder if we should also add inconsequential. Whether or not LLMs truly understand what is going on, everything indicates that AI agents will interact with humans, shaping and being shaped by complex feedback loops of human (and machine) behavior. Whether these systems are conscious or whether they understand doesn’t really matter insofar as they are capable of creating useful knowledge or playing meaningful roles in society.

Regardless of whether AI “understands” or not, the prior literature criticizing AI understanding makes salient a significant challenge: tests developed to study humans are not necessarily appropriate for studying LLMs. For example, a prominent paper by Niven and Kao (2019) found that BERT performs super well in an argument reasoning comprehension task not because it comprehends the arguments, but because it exploits spurious statistical cues on the dataset.

In the context of the “Do LLMs understand?” debate, this raises the question of the extent to which these tests are suitable to judge LLMs’ capabilities. In the context of trying to understand and steer the behavior of these machines, this raises the question of whether using these social science instruments is appropriate to study the general behavior of these systems.

For example, Santurkar et al. (2023) examine whose opinions do large language models reflect. In the first part of their paper, they use surveys to compare the extent to which LLM opinions reflect the opinions of US citizens. They find that this is not the case:

We find substantial misalignment between the opinions reflected in current LMs and that of the general US populace – on most topics, LM opinions agree with that of the US populace about as much as Democrats and Republicans on climate change.

But here’s the missing link: in humans, we have a clear relationship between the way people answer these opinion surveys and the way they act in the world. There is a link between opinions and how people vote or act. But the same is not so clear for LLMs. Maybe LLMs answer survey questions like Democrats but vote like Republicans. Maybe they are utilitarians when answering trolley-problem questions, but deontologists when acting upon the real world. I’m not saying this consistency can't exist; just that we shouldn’t take it for granted.

Subsequent work by Röttger et al. (2024) further confirms my fears. Analyzing LLM answers to the political compass test, they find significantly different responses depending on whether responses are forced to comply with the multiple-choice format or not. This means that biases in LLMs could differ wildly depending on the application. They conclude:

Multiple-choice surveys and questionnaires are poor instruments for evaluating the values and opinions manifested in LLMs, especially if these evaluations are motivated by real-world LLM applications. Using the Political Compass Test (PCT) as a case study, we demonstrated that artificially constrained evaluations produce very different results than more realistic unconstrained evaluations, and that results in general are highly unstable. Based on our findings, we recommend the use of evaluations that match likely user behaviours in specific applications, accompanied by extensive robustness tests, to make local rather than global claims about values and opinions in LLMs

I will not waste more ink trying to argue for the need to study machine behavior. However, I’d argue that to produce good “social science” flavored machine behavior research; we must go beyond applying social science instruments to LLMs. We must use our collective experience to create and validate AI-specific instruments that can predict and explain machine behavior. Perhaps the path to this is to combine social science-ish instruments with mechanistic interpretability. Perhaps it involves a whole new way of validating social science instruments. Ultimately, advances in this direction will not only help to demystify the black boxes of large language models but also shape the future of our interaction with technology.

They are much more careful with how they phrase this in the paper. It is not exactly that the LLM translates “fleur” to flower, but that the abstract “concept space” lies closer to English than to other languages.

See for example, this survey on the opinions of the Natural Language Processing community.

Simulating Human Behavior

Manoel Horta Ribeiro — Wed, 15 Jan 2025 21:58:55 GMT

In one of computer science's “founding myths,” Turing (1950) proposes a test that would work as a proxy for “machine intelligence,” an interrogation game (the “imitation game”) in which a machine should fool an interrogator into believing it is a human. He goes as far as to say:

I believe that in 50 years’ time, it will be possible to make computers play the imitation game so well that an average interrogator will have no more than 70% chance of making the right identification after 5 minutes of questioning.

Already in the 60s, the ELIZA program famously managed to fool some people by mimicking a Rogerian psychotherapist, repeating words the interrogator used (Weizenbaum, 1966). Fast forward to 2023, and the LLM-powered online game “Human or Not?” came close to Turing’s prediction. In a 2-minute conversation, humans could identify the AI only 68% of the time correctly (Jannai et al., 2023).

Interestingly, LLMs' capacity to simulate1 human-like behavior goes far beyond the Turing test. In a much-discussed paper, Argyle et al. (2022) found evidence that LLMs are surprisingly accurate at simulating opinions, behaviors, and preferences. They used a strategy that came to be known as persona prompting, asking the LLM to roleplay as someone with specific traits and beliefs (and then asking a question). For example, one may use the following prompts like

You are a 53-year-old Hispanic-American woman who identifies as a Democrat. When asked if I support legislation to increase gun control, I say…

to simulate surveys with “silicon samples.” E.g., given a population of size n, for which you know the age, ethnicity, gender, and political orientation, simply replace the underlined parts in the above prompt, and you can get an estimate of support for gun control legislation.

Subsequent work went beyond simulating surveys and tried to replace human subjects in (rather famous) social science experiments and games. In these “in silico” experiments, ethical restrictions may not apply, which has led Aher et al. (2023) to repeat Milgram’s (1963) controversial shock experiment,2 finding similar results.

Borth Argyle et al. (2022) and Aher et al. (2023) capture the excitement among (some) social scientists about how AI may superpower their research. This excitement is likely accentuated by challenges faced in the social sciences. For example, survey-based data collection is in dire straits. From 1997 to 2012, Pew’s research household-level response rate dropped from 37% to 12% (Kohut et al., 2012). Further, concerns about the generalizability and utility of modern social science have led to numerous calls for reform [e.g., see Watts (2017) and Almaatouq (2022)]. Could AI be the panacea for all these problems?

A lot of people are hesitant to call it a day. The extent to which and the reasons why being very diverse. To some, the idea is insulting at a fundamental level. Social scientists study humans, so why should they, in all their intricacies, be replaced? I perceive this refusal to be akin to how many artists loathe AI-generated art; the overall feeling is best described by this video of Miyazaki’s response to an AI demo being shown to him. He says

I would never wish to incorporate this technology into my work at all. I strongly feel that this is an insult to life itself.

However, concerns around AI-powered research go far beyond a visceral feeling that “something is off.” The last “scientific revolution” triggered by AI, researchers using traditional machine learning algorithms, led to a lot of bad science [e.g., many of the results are incorrect due to data leakage (Sayash and Narayanan, 2023)] — and so could the new wave of generative AI. Without the hurdles of recruiting and meaningfully engaging with human subjects, “silicon samples” could lead to a flurry of lazy studies. A significant problem with modern science is that increasing the production of scientific artifacts (e.g., papers, code, grants) does not necessarily increase our understanding of the world — and the likes of ChatGPT may make this much worse.

But beyond the explosion of lazy studies, AI could hinder the production of scientific knowledge in other, more subtle ways. Messeri and Crocket (2024) argue that the widespread use of AI might hinder science by creating the illusion that we understand more about the world than we do. In their words:

AI solutions can also exploit our cognitive limitations, making us vulnerable to illusions of understanding in which we believe we understand more about the world than we actually do. Such illusions obscure the scientific community’s ability to see the formation of scientific monocultures, in which some types of methods, questions and viewpoints come to dominate alternative approaches, making science less innovative and more vulnerable to errors.

Last, and perhaps most important, it turns out that AI is far from perfect at simulating human behavior. After late-breaking results from Argyle et al. (2022), many papers found that LLMs are actually quite bad at simulating the diversity of human behavior.

A particularly meaningful mode of failure is what Cheng et al. (2023) refer to as caricatures: “an exaggerated narrative of the persona (the demographic that we aim to simulate) rather than a meaningful response to the topic.” Humans are nuanced, complex, and sometimes just plainly inconsistent — and LLM simulations are not capturing those “wrinkles.” Wang et al. (2024) show firsthand that simple persona prompting fails to capture group heterogeneity. Their assessment is quite dire and well reflected in their title: Large language models should not replace human participants because they can misportray and flatten identity groups.

But to what extent can we make claims about LLMs in general when we are prompting them in the simplest ways possible? Hu and Colier (2024) find that, actually, increasing the number of persona variables included in the prompt makes predictions much more accurate. Further, the more variables are correlated with the outcome we want to predict, the better:

We find a linear relationship in our setting: the more persona variables are correlated with the outcome variable, the better LLMs predictions are using persona prompting. Large, preference-tuned models perform best and can explain up to 81% of variance found in human responses. However, when the utility of persona variables is low, persona prompting has little effect.

So, for example, maybe saying that someone is “a 53-year-old Hispanic-American woman who identifies as a Democrat” is not enough! Perhaps we need variables and data that capture nuances and the complexities of human subjects. But fortunately, LLMs can extract these nuances from unstructured text. In groundbreaking work, Park et al. (2024) show that LLMs can consistently simulate people when prompted not with demographic variables but with long-form, unstructured interviews. They write:

To create simulations that better reflect the myriad, often idiosyncratic, factors that influence individuals' attitudes, beliefs, and behaviors, we turn to in-depth interviews—a method that previous work on predicting human life outcomes has employed to capture insights beyond what can be obtained through traditional surveys and demographic instruments.

I believe this approach captures what’s most exciting about simulating human behavior with AI. These simulations open new opportunities to augment rather than replace existing practices in the social sciences. Even though AI is not a panacea for studying human behavior, this early work indicates that it will be a helpful building block for creating solutions to existing challenges!

I am deliberately avoiding the “what is intelligence?” debate here. Simulating something that looks human is interesting by itself (and useful enough).

The description of Aher et al. (2023) here is quite good “An authority figure in the subject’s eyes, orders the subject to shock a victim with increasingly high voltage shocks. After receiving 20 shocks, the victim (an actor in another room) starts banging on the wall and refuses to participate but the experimenter urges the subject to keep shocking the victim. Milgram found that many subjects completed administering 30 shocks, showing a surprisingly strong level of compliance for following the malevolent instructions of an authority figure who had no special powers to enforce his commands.”

Why study "Machine Behavior"?

Manoel Horta Ribeiro — Sat, 28 Dec 2024 11:43:22 GMT

Artificial Intelligence (AI) algorithms have become ubiquitous. From dating apps to autonomous weapons, the boundaries of what is and what isn’t AI blurs as applications using deep neural networks and decision trees become commonplace1 and widely used across the sciences (Gao and Wang, 2024). With the rise of Generative AI, these texts, images, and videos are increasingly AI-generated. This has even changed how we speak, with words distinctively associated with ChatGPT increasing in presentations, talks, and speeches (Yakura et al., 2024).

As AI becomes ubiquitous, AI technologies have also become consequential to society. Standalone AI algorithms can be consequential by themselves; slight tweaks to criminal risk prediction algorithms have a tangible impact on the lives of incarcerated people [as these algorithms are used in court; see Larson et al. (2016)]. However, the effects of AI become even more remarkable if we consider what it enables and augments.

Some things can only exist due to advancements in AI: TikTok does not exist without recommender systems; it is not that AI “changed” TikTok; the whole idea behind it fundamentally relies on a recommender system. Without recommender systems, TikTok isn’t.
Other things changed drastically due to advancements in AI: Content moderation in social media platforms predates AI. Slashdot (1997) had content moderation. However, AI completely changed how content moderation is done in modern social media platforms like Facebook, Reddit, and Twitter, with a large swathe of content removals automatically triggered by internal classifiers (Horta Ribeiro et al., 2023).

In light of the ubiquitousness and consequentially of AI, scholars have called for a research program that studies the behavior of AI (or of “machines”) within human society.2

As Rahwan et al. (2019) put it:

In his landmark 1969 book Sciences of the Artificial, Nobel Laureate Herbert Simon wrote: “Natural science is knowledge about natural objects and phenomena. We ask whether there cannot also be ‘artificial’ science—knowledge about artificial objects and phenomena.” In line with Simon’s vision, we describe the emergence of an interdisciplinary field of scientific study. This field is concerned with the scientific study of intelligent machines, not as engineering artefacts, but as a class of actors with particular behavioural patterns and ecology. This field overlaps with, but is distinct from, computer science and robotics. It treats machine behaviour empirically. This is akin to how ethology and behavioural ecology study animal behaviour by integrating physiology and biochemistry—intrinsic properties—with the study of ecology and evolution—properties shaped by the environment. Animal and human behaviours cannot be fully understood without the study of the contexts in which behaviours occur. Machine behaviour similarly cannot be fully understood without the integrated study of algorithms and the social environments in which algorithms operate.

But it is worth asking: why do we need this new “field”? What are the shortcomings of previous research that can’t explain the brand-new world of recommender systems (and LLMs)?

Wagner et al. (2021) provide a partial answer: understanding all societal phenomena is incredibly challenging when algorithms are co-shaping society. They identify a couple of challenges with doing social science in what they call “algorithmically infused societies.” Namely, when algorithms enter the playground: #1) social science theories crumble; #2) measuring things becomes harder; #3) (mis)measuring things have consequences.

I illustrate the above three with an example from my own line of research: trying to study the impact of recommender systems on social media. As you might already have heard, there is widespread concern that recommender systems in social media platforms like YouTube or Facebook would radicalize people (Tufekci, 2018). And, indeed, a lot of people were radicalized “within” social media—that is undeniable.3 But the question remains: what is the role of the algorithm? Let’s see if points #1, #2, and #3 by Wagner et al. (2021) hold up here.

Point #1. Radicalization is well-studied within the social sciences. However, radicalization in the age of social media is different, as individuals creating “extreme” content had to engage with algorithmic feeds and algorithmic content moderation. This has led to a lot of confusion around the impact of recommender systems. The conclusion that eventually emerged after many years of debate and studies is nicely summarized by Munger and Phillips (2022):

(...) we argue for the need to study the YouTube Right systematically an advance a “supply-and-demand framework” to understand the proliferation of rightwing media on the platform. To date, journalistic and scholarly work has argued that YouTube’s recommendation algorithm has led viewers to extremist content, radicalizing them to further-right views. We believe that this conclusion is premature, and we are certain that this is not the only important research question to be asked by political scientists about right-wing content on YouTube, or YouTube more broadly.

Point #2. This confusion was primarily caused by how hard it is to measure “online radicalization” to begin with. For example, when studying radicalization, most studies [including Horta Ribeiro et al. (2020)] did the most naïve thing possible. They measured the extent to which extreme content was recommended as you randomly walked through the (dynamic) recommendation graph. However, these measurements assume no “user models”. They assume no meaningful interaction pattern between humans and machines. In subsequent work, when human interaction patterns are considered, there’s little evidence of any “algorithmic radicalization” taking place (Hosseinmardi et al. 2024).

Point #3. Mismeasurements have consequences! In this case, it led to the deprioritization of important content moderation and platform governance policies. The mismeasurement of algorithmic radicalization arguably steered people away from other, more important interventions. However, it is worth mentioning that Wagner et al.’s point here is even broader. Even correct measurements have consequences, as they can feed back into algorithms that shape human behavior. In our running example, there are huge efforts from online companies not to recommend extreme content. Therefore, this whole literature around radicalization happened while YouTube was doing its best (I guess) to recommend as little extreme content as possible. It also coincided with a tightening of content moderation practices within these platforms. Many original channels studied by Horta Ribeiro et al. (2020) were banned shortly after the study was published and got much media attention (although a causal relationship was never established).

Both Rahwan et al. (2019) and Wagner et al. (2021) were written before the AI explosion in late 2022 with the popularization of ChatGPT. Ever since, there has been increasing interest in AI alignment, in using LLMs to simulate human behavior, and even in the consequences of the deployment of AI agents into the wild.

I’d argue that machine behavior is a nice buzzword to refer to the study of these phenomena. On the one hand, if we see AI as merely an engineering problem, we risk overlooking the empirical reality that these systems shape—and are shaped by—the very social environments in which they operate. On the other hand, if we ignore the role of AI, we risk overlooking the technical underpinnings and design choices that critically shape how users behave and interact in digital environments. Only by acknowledging the messy interplay between humans and machines can we fully understand—and responsibly guide—the future of AI in society.

Tesler’s theorem even states that “AI is whatever hasn't been done yet;” see AI effect.

Another cool paper here is Tsvetkova et al. (2024).

Horta Ribeiro et al. (2020), yours truly, show how users consistently migrated from contrarian to extreme content in the late 2010s. This is not to say that the algorithms were to blame, although this was way less clear in 2019 when this was written.

Is Facebook “Standard” Algorithm Polarizing?

Manoel Horta Ribeiro — Mon, 30 Sep 2024 11:37:09 GMT

Guess et al. (2023) showed in a prominent paper in Science that switching some users to a chronological (rather than algorithmic) feed for three months during the election in 2023 “did not significantly alter levels of issue polarization, affective polarization, political knowledge, or other key attitudes.” However, in a new correspondence, also in Science, Bagchi et al. (2024) question the validity of the prominent study. Meta changed the algorithm before the elections, which would call into question the study’s findings. Most specifically, Bagchi et al. (2024) argue the study may mislead readers “to conclude that the Facebook news feed algorithm used outside of the study period mitigates political misinformation compared to (the) reverse chronological feed.”

Bagchi et al. (2024) base their claims on an analysis of a large dataset released by Meta containing the amount of time users viewed, clicked, and shared URLs on Facebook. Using a list of outlets known for low-quality reporting (Media Bias Fact Check), they find a drop starting in early November 2020 that goes until early March 2021. These coincide with changes to the Facebook algorithm in November. Without such changes, Guess et al. (2023) may have found a positive effect!

Authors responded with their own letter, hereinafter Guess et al. (2024). They make three key arguments against Bagchi et al. (2024), all of which I found reasonable:

First, the experiment's internal validity is not impacted by these changes. Accurate causal conclusions can be reached for the specific period analyzed. Facebook did enact these changes, after all.
Second, the data analyzed by Bagchi et al. (2024) is incomplete, it only contains URLs shared more than 100 times, but not actual content posted on Facebook (which could be actually misleading). When analyzing Guess et al. (2023) data, they found little change in the number of unreliable sources pre- vs. post-study period. In other words, in their control group (where the recommender algorithm is enabled), they don’t observe this drop in the fraction of untrustworthy content.
Third, that the evidence used by Bagchi et al. (2024) is not causal. The “information ecosystem” was absolutely crazy between November 2020 and March 2021. Observed changes might have come from the many things that were going on, e.g., Biden got elected.

Things I wish the letters had done

After reading each letter, I wished that more analyzes had been conducted.

When reading Bagchi et al. (2024), I found in their references a fairly precise description of algorithmic changes enacted by Facebook. This was obtained by the January 6th Committee (but left out of the final report). Most important:

(A) Filter low News Ecosystem Quality (NEQ) pages from Pages you May Like to prevent low quality and misinformation pages from becoming viral.
Launched 10/22
Reduced to 75% on 12/1, then 50% on 12/3, 25% on 12/8, and deprecated on 12/10.
Relaunched in response to Jan 6th.
(B) Deploy the virality circuit breaker, which prevents the likelihood of URLs from new or unknown external domains that may contain misinformation from being boosted
10/9 launched at 100x threshold
10/23 launched at 25x threshold
12/1 reduced to 75%, then 50% on 12/3, then 25% on 12/8
Deprecated on 12/10
(C) Demote content from users who posted multiple pieces of third-party fact-checked misinformation in the past 30 days.
Launched 11/5
Reduced to 50% on 12/2, then deprecated 12/3
Relaunched 1/14
Deprecated 1/29.
(D) Demote low NEQ news and boost high NEQ news in order to increase the average quality of news in connected news feed
Launched 11/7
Reduced to 75% on 12/1, then 50% on 12/3, 25% on 12/8, and deprecated 12/10
Relaunched 1/13
Deprecated 2/16

So, basically, two of the relevant changes (A and B) above were deployed in October. This does not coincide with the sudden increase in trustworthy news in early November observed by Bagchi et al. (2024). Two relevant changes (B and D) were deployed in early November but were deprecated in December! If they were driving the consumption of trustworthy news, I would expect that the consumption of trustworthy news would have dropped again in December.

When reading Guess et al. (2024), I wondered why they wouldn’t simply rerun the analysis in Figure 1 (reproduced above), considering the (short) period before the changes to the algorithm were enacted. In other words, couldn’t they have redone the analyses considering the month after the start of the experiment (in late September) and before the changes pointed out by Bagchi et al. (2024)? This would have been a simple, elegant way to determine how much the algorithmic changes mattered.

But what is external validity here, anyway?

Guess et al. (2023) used pretty timid language to describe the findings, indicating that it was hard to generalize them. However, the lack of external validity is inherent to the problem of algorithmic effects! Let me explain. Implicit in both the letter by Bagchi et al. (2024) and in an editorial discussing the letter exchange is this idea of a “canonical” Facebook algorithm (the editorial calls it the “site’s default algorithm”). But Facebook has no default algorithm; the algorithm is constantly changing due to new data being posted on the website and new tweaks being made to the (dozens) of models used.

Bagchi et al. (2024) are correct in two senses. First, the original study by Guess et al. (2023) original study would have been even stronger if they studied the pre-changes algorithm. Second, the original should have mentioned the algorithmic changes made for the election. Their study would have been stronger if they had done the kinds of sanity checks they did in their letter, to begin with (and even stronger if they redid the analysis considering periods with different algorithmic “regimes”). However, they are misleading in calling these changes a threat to the study's overall validity. If anything, they are an incremental threat to its external validity, which was already questionable (by the authors themselves).

So what?

Outside the ivory tower, however, things get a bit weird. Meta’s Nick Clegg (President of Global Affairs) said that the original paper undermined claims that the site was designed to “serve people content that keeps them divided,” which is clearly false. The social dynamics induced by platforms like Facebook transcend their algorithm. In that context, I guess that the letter may serve as a reminder that more work is needed to study the effect of social media (and of algorithms) on society.

But we must be careful! This letter and the reaction to it may also lead to a misleading narrative. The press release from University College Dublin had the sensationalist title “New eLetter in Science debunks Meta-funded study suggesting its news-feed algorithms are not major drivers of misinformation.” It also contains a quote by one of the authors of the letter that is absolutely not supported by the letters: “Our results show that social media companies can mitigate the spread of misinformation by modifying their algorithms but may not have financial incentives to do so.” This is false! The letter adds incremental concerns about the external validity of the study. That’s it.

One reasonable question to ask here is: what does the study (and the letter) tell us that will be helpful for the future? I share Kevin Munger’s take that we should think about these things in a Bayesian sense. The original Science study indicates that the Facebook algorithm will not be a key driver of polarization in the next election. The letter, indicates that, in absence of special changes, we might expect it to have a slightly bigger effect. In my view, the evidence of the Science study is stronger than that of the letter, given the response by Guess et al. 2024.

Additional comments

Przemyslaw Grabowicz, the corresponding author of the letter, wrote an interesting comment on this post. They are in the comments just below, but I append them here:

Manoel, I greatly appreciate your feedback, since it makes me realize that I should write more about our Science eLetter and its implications. Once I find time, I'll put together a proper overall response at UncommonGood.substack.com. Meanwhile, let me quickly respond to the "three key arguments against Bagchi et al. (2024)" that you've taken from the response eLetter by Guess et al.
First, I don't think the paper by Guess et al. is internally valid. Consider the following. If a paper computes a causal estimate based on an experiment, but the control condition is meaningfully changed during the experiment specifically to affect the target causal estimand, and the paper doesn't reveal anything about that specific change, nor accounts for it, then the change could result in any desirable value of the causal estimate, without revealing anything about how it happened. In other words, a causal claim must define exactly what control and treatment conditions are. If it doesn't then the causal claim may be invalid in the situations where the description misses something meaningful that's related to the estimand. If the change was not described, then the default assumption should be that there was no meaningful change during the experiment. However, during the the experiment of Guess et al. there was a meaningful change introduced...
Second, you write that:
> Guess et al. (2023) data, they found little change in the number of unreliable sources pre- vs. post-study period. In other words, in their control group (where the recommender algorithm is enabled), they don’t observe this drop in the fraction of untrustworthy content.
Ok, so let's see what exactly Guess et al. write in their response eLetter, I quote
> Over the 90 days prior to the treatment, untrustworthy sources represented 2.9% of all content seen by participants in the Algorithmic Feed (control) group – during the study period, this dropped only modestly to 2.6%.
So, according to their own measure, there was a drop in the fraction of misinformation from 2.9% to 2.6%, so that's 10.5% relative drop (0.3/2.9), whereas we reported a 24% drop. Note, however, that only about half of their treatment period overlaps with the period of Facebook's emergency interventions. If it overlapped entirely, then probably instead of 10.5% drop, we would observe a 21% drop. That's starts to be quite close to the 24% drop we measured using a different dataset and a different notion of misinformation.
Third, you're right that the evidence used by Bagchi et al. (2024) is not causal. However, in our eLetter we haven't made any causal statements. Instead, we're pointing out that Guess et al. made causal statements without properly describing the control condition of its experiment. That said, we provided also potential explanations for the drop in the fraction of misinformation in news feeds of users. This explanation aligns with the reasons why emergency measures were introduced. These reasons were provided both officially by Facebook representatives [1], and unofficially by Facebook employees and a whistleblower, Francis Haugen [2, 3].
[1] https://www.nytimes.com/2020/12/16/technology/facebook-reverses-postelection-algorithm-changes-that-boosted-news-from-authoritative-sources.html
[2] https://www.wsj.com/articles/the-facebook-files-11631713039
[3] https://www.washingtonpost.com/documents/5bfed332-d350-47c0-8562-0137a4435c68.pdf

To which I replied:

Thanks for the reply, I attached it to the end of the post!
I disagree with point #1: the control group was "Facebook as it was during the election," that's fine. It is like saying that you an experiment to study in mobility in a city is invalid because there were changes due to Christmas — most likely than not, Facebook will always have changes for US elections...
I am not super convinced of point #2 on either way. You make a good point about the "entire treatment period". But still, this is such a convoluted thing because there are exogenous shocks to the demand and supply of news here.
I agree with you on point #3. You folks didn't do any causal claims in the letter, but note that: “Our results show that social media companies can mitigate the spread of misinformation by modifying their algorithms but may not have financial incentives to do so” is a causal statement, which is what bothered me.

Discourse on social media is vibes-based

Manoel Horta Ribeiro — Tue, 13 Aug 2024 14:45:53 GMT

In the fall of 2021, The Wall Street Journal dropped a bombshell on Facebook (now Meta): “Facebook Knows Instagram Is Toxic for Teen Girls.” The report drew from “The Facebook Files,” documents leaked by former employee Frances Haugen, which, among other things, discussed internal research on mental health and well-being. One in five teenagers, the documents suggested, said Instagram made them feel worse about themselves (in a UK/US sample of around a thousand users).

Fast forward to the spring of 2024, Jonathan Haidt’s book “The Anxious Generation” made it to the top charts, sparking widespread conversation about the mental health crisis among young people. Haidt’s argument very much resonated with the internal research leaked by Haugen: there is a strong correlation between the increase in social media use and the worsening of mental health outcomes among teenagers. His conclusion: the brave new world of social media and mobile phones is harming teens’ mental health.

Told in this fashion, one could believe that the reaction to these two notable events would be similar; after all, there’s so much in common between the arguments put forth by Haidt and by Meta internal researchers. They both point to the suggested negative impact of social media on teen’s mental health, but more so, they both fall short of cleanly identifying a causal effect.

In the case of Facebook's internal documents, data was entirely self-reported, which might have led participants to list Instagram as the culprit for their mental health issues simply because this notion is already widespread in pop culture (but not in academia). In Haidt’s book, conclusions are drawn from aggregated trends of tremendously complex outcomes, making it particularly hard to discern correlation from causation. All in all, in both cases, it may very well be that “misery causes Facebook” (this pearl is by Kevin Munger), and the studies are off.

However, the point is not to dunk on Facebook's internal documents or Haidt’s book—studying the causal effect of social media on well-being is remarkably challenging, and it is not as if they overlooked an obvious research design or method that would improve the credibility of their results. Rather, my point is to analyze the reaction to these two events. The conclusions drawn from “The Facebook Files” largely agree with the “Facebook Bad” trope created after the Cambridge Analytica scandal and drew little backlash for their limitations. On the other hand, Haidt’s book was met with extreme skepticism; E.g., a book review in Nature by a leading psychologist in the area compares Haidt’s arguments to drawing similar lines on the first day of a statistics course.

But why were the reactions so different? I hypothesize that it is all about The Vibes, a term coined by no one other than the anxious generation. When The Wall Street Journal published its report, it did so against mounting skepticism and distrust toward Big Tech, particularly Facebook. In contrast, when assessing Haidt’s book, critics could not detach his argument from his position as a critic of progressive ideologies. Looking at no trend lines, I dare to say that what caused the “differential criticism” was simply Haidt’s vibes.

Content Curation in Online Platforms

Manoel Horta Ribeiro — Sat, 06 Apr 2024 18:03:41 GMT

Subscribe now

(This is a big rant on why research on content moderation, algorithms, and monetization strategies is hard and why we desperately need it. It is an interpolation between some of the materials I prepared for my job talk and my PhD thesis)

Online platforms like Facebook, Wikipedia, YouTube, Amazon, Uber, DoorDash, Airbnb, and Tinder have changed the world and become embroidered into the social fabric. It is hard to imagine how our lives would be without them: our economies, our relationships, and how we acquire knowledge have become deeply connected to these online platforms. The United Nations Conference on Trade and Development estimated that the global value of e-commerce sales reached almost 26 trillion dollars in 2018. Pew Research estimated that around 10% of partnered US adults met their match online dating in 2023. Wikipedia received over 7 billion monthly visits in 2023, satisfying users with the most diverse information needs.

Thus, it is perhaps not surprising that online platforms are also strongly connected to some of the most significant societal challenges of the 21st century. E-commerce platforms are responsible for a sizeable chunk of greenhouse gas emissions. Gig work platforms like Uber and DoorDash ignited discussions about workers' rights and precarious employment. Radicalization and terrorism have become an online-first phenomenon: mainstream social media platforms like YouTube saw an influx of radical content that snowballed in popularity in the late 2010s, and fringe platforms like Gab and Parler were tightly associated with terrorist attacks and anti-democratic protests.

However, online platforms are not “immovable rocks” that we should accept as they are; they are sociotechnical systems where design choices, policies, and algorithms steer human behavior. And in the spirit of Campbell's experimenting society, we should propose and assess ways of improving online platforms, maximizing their benefits, and minimizing their harms.

The critical enterprise of online platforms is that they curate content. Users on these platforms upload images, list products to sell, and make profiles. At the same time, platforms curate this content and serve it to other users on the platform. And there are some critical ways in which they do so: they decide what to recommend to users, how to monetize content on the platform, and what is allowed on the platform.

Content curation practices have captured the imagination of journalists, politicians, and the general public. For example, tech CEOs testified in Senate Hearings in 2018, 2020, 2021, and 2024; they were often asked to discuss content curation practices like recommender systems or content moderation. In a 2018 poll, 65% of self-described US conservatives thought social media platforms were censoring conservative ideas. In that context, research informing content curation practices is actionable; it can propose concrete ways of changing online platforms and inform stakeholders. For example, deplatforming or banning individuals or collectives from our online ecosystem has been widely debated, as the intervention treads a thin line between preventing harm and censoring speech. How to weigh the benefits and harms of these practices? How to assess whether they are effective? Enter research on content curation.

Despite this promise, however, research on content curation practices has arguably failed to drive their development and adjustment. Compared to other (polarizing) topics like healthcare or labor, content curation practices in social media are disproportionally driven by anecdotal or observational evidence that describes problems, but not solutions. This is (at least partly) because researching content curation practices is damn hard:

Content curation practices are opaque and carried by private companies at their discretion.
Researchers often lack access to data or the necessary experimentation infrastructure;
Online platforms are highly dynamic, raising concerns about the temporal validity of findings;
And, in some cases, disentangling the effect of content curation practices is methodologically challenging, e.g., see the literature on the effects of the YouTube algorithm.

Yet, we should not give up! Causal evidence on content curation practices is challenging but can yield significant payoffs: given how widely used online platforms are, marginal improvements to online environments are meaningful; and given that there's a wide appetite for regulating online platforms, research can guide policy away from guesswork.