The Empiricism Gap in Computer Science
So far, I have argued that there is a dissonance between, on the one hand, CS’s founding myths, curricula, and self-image, and, on the other hand, the modern production of knowledge in computer science.
Here, I want to name the dominant style of empiricism that has emerged from that reality, particularly in machine learning. For lack of a better term, I’ll call it build-and-test empiricism. As the name suggests, this flavor of empiricism treats construction as the primary research method: you build a model, a system, or an algorithm, run it, measure the outcomes, and iterate. It works best when the field shares a common yardstick (benchmarks, leaderboards, standardized testbeds) because then disagreement collapses into a simple question: which approach performs better under the same evaluation?
In this modus operandi, knowledge accumulates through artifacts and evidence: a working construction, together with empirical performance results that others can reproduce, compare, and build on. A canonical example is the ImageNet Large Scale Visual Recognition Challenge (ILSVRC), which turned image classification into a shared yardstick: everyone trained on the same large dataset and compared on the same evaluation metric. When Krizhevsky, Sutskever, and Hinton entered AlexNet in ILSVRC 2012, they didn’t just propose “a good idea,” but were able to prove the value of their contribution very easily.
I’ll contrast this with another flavor of empiricism, prevalent in political science, economics, and epidemiology. Again, for lack of a better term, I’ll call it describe-and-defend empiricism. This flavor of empiricism treats inference as the first-class citizen: you formulate a claim about the world, assemble evidence, and argue why the data support that claim. It works best when the field shares a common language for what counts as evidence—identification, counterfactual reasoning, and uncertainty—because then disagreement collapses into a more straightforward question: is this claim credible given the design and the assumptions?
In this modus operandi, knowledge accumulates through the refinement and defense of generalizable statements about reality. A canonical example is the minimum-wage debate. Instead of “does this model beat that model on ImageNet?”, the shared question became a claim about the world: do minimum-wage increases reduce employment? Card and Krueger (1994)’s famous New Jersey–Pennsylvania study framed the problem as a quasi-experiment: New Jersey raised its minimum wage in 1992, Pennsylvania did not, so that you can compare employment changes in fast-food restaurants across the border before and after the policy change (a classic difference-in-differences design). Their headline result challenged the textbook prediction, sparking years of dispute not only about the answer but also about whether the identification strategy, measurement, and data sources justified the claim.
That is describe-and-defend empiricism in action: the conversation quickly moved to critiques, reanalyses, and replies (e.g., Neumark & Wascher’s comment and Card & Krueger’s reply in the AER), because the community is built around adversarial scrutiny of credibility. The debate then generalized to broader designs, such as using discontinuities at state borders across multiple policy changes, to test whether the claim holds up across settings.
Maintenance Layer
A helpful way to contrast these different flavors of empiricism more sharply is to separate the engine of each style from its maintenance layer. In other words, once a community has a means of generating results, how does it ensure honesty? What are the routines and institutions that notice when the engine starts producing nonsense: when results don’t replicate, when benchmarks get gamed, when claims are overstated, or when an entire methodological approach is proven incorrect? The maintenance layer is the answer: the field’s error-correction machinery, the set of practices that turns “a pile of results” into something closer to cumulative knowledge.
As we’ve mentioned, build-and-test empiricism has an apparent engine: artifacts plus shared yardsticks. But shared yardsticks come with predictable pathologies. Because benchmarks are finite and public, it is easy to overfit not only models to data but scientists to metrics: countless small choices (hyperparameters, prompt formats, data filtering, selection of seeds) get tuned until they look good on this evaluation. Reproducibility also becomes fragile: in modern ML, results can turn on details that are hard to fully specify—hardware, distributed training quirks, nondeterminism, dataset versions, and preprocessing pipelines. And even when a paper’s headline number is “real,” it can be a narrow kind of reality: a win that evaporates under distribution shift, under a slightly different label definition, or in a deployment setting that wasn’t represented in the benchmark.
The community’s response is often to patch the process with rituals—lightweight forms of quality control meant to raise the floor without slowing the iteration loop too much. Some rituals target reporting (checklists; standard tables and ablations), some target auditability (artifact evaluation and badging; code and data release norms), and some target interpretability and scope (model cards, datasheets, documentation templates). None of these rituals “solves” the epistemology of generalization. But they are a pragmatic maintenance strategy: a way to keep the engine running while limiting the most common failure modes, and to make it harder for the field to confuse a brittle benchmark win with a robust piece of knowledge.
Describe-and-defend empiricism has a different engine: claims anchored in an argument about why the data identify what you say they identify. Its maintenance layer looks different, too. Because the risks are not “overfitting the benchmark” so much as “fooling ourselves with confounding, researcher discretion, or fragile specifications.” Thus, the field expends considerable effort on robustness: alternative specifications, placebos, sensitivity analyses, replication across contexts, and explicit uncertainty quantification. The rituals exist here too, pre-analysis plans, preregistration, replication packages, but they are downstream of a deeper norm: you do not get to make a strong claim unless you can defend it.
No better way to see the difference between the two maintenance layers than to observe how each ecosystem responds when it discovers a methodological error. In econometrics, the recent “difference-in-differences reckoning” is a good example: researchers realized that the vanilla two-way fixed effects DiD regression behaves badly under staggered adoption and heterogeneous treatment effects (Goodman-Bacon, 2021). In plain English, people were doing bad stats for a couple of years. The response was not to “use a checklist next time.” Instead, the field produced new estimators, review papers with concrete recommendations, and, crucially, a wave of re-analyses: applied papers were re-estimated with updated methods to see which conclusions survived (e.g., Nagengast and Yotov, 2025).
Machine learning handled “mistakes” differently. For example, when a benchmark is found to be leaky, saturated, or overly gameable, the community typically shifts the goalposts rather than revisiting the back catalog. Consider ImageNet: when Recht et al. (2019) built fresh test sets and found nontrivial accuracy drops, evidence that the community had partially overfit to a heavily reused benchmark. The typical response was not to issue corrigenda for years of ImageNet papers. Instead, the field continued to iterate on the next yardstick. E.g., by the time Recht published his paper, computer vision’s center of gravity had already shifted from image classification to tasks such as detection, segmentation, and captioning, and to datasets such as MS COCO. Old papers remain “fine,” not because their conclusions are timeless, but because they were produced under the rituals that were locally acceptable at the time. The maintenance layer is oriented toward “patch-and-proceed” rather than “re-estimate-and-reckon”.
Why build-and-test empiricism is not enough
This piece is a critique of building-and-testing empiricism. But it is worth steelmanning it before criticizing. In many ways, the progress in ML and CS more generally is the envy of other disciplines. Build-and-test empiricism creates fast feedback loops, rewards concrete contributions, and makes knowledge unusually transmissible. Under that regime, research “compounds”: the community shares datasets and code, converges on testbeds, and iterates quickly enough to discover regimes that theory did not predict. Moritz Hardt put this better than I ever could in his new book on ML benchmarks: “researchers might’ve indeed dismissed the idea of benchmarking as unreasonable, if it hadn’t been so successful.”
So what’s the problem? Especially outside of machine learning, CS increasingly studies domains that lack a benchmark to guide iterative development, and many of the most critical questions are explicitly causal: Does this interface change behavior? Does this policy reduce harm? Does this intervention improve well-being? When build-and-test empiricism loses its yardstick and lacks a shared causal language, it becomes easy to produce papers that look empirical but do not cumulate. It is unclear what claim was actually identified, and unclear what a “better” result should even mean.
The deeper issue is that we often keep a build-and-test mentality even when the object of study has shifted from artifacts to claims. In ML, a narrow win can be “fine” because the claim is narrow: this method improves this metric under this protocol. But in HCI, privacy, and policy-adjacent CS, the implied claim is usually broader: that some design choice will change how people behave, or that an intervention will improve real outcomes. Those are high-stakes claims—and yet our field rarely treats them with the institutional seriousness that claim-centered disciplines have built up over decades: routine re-analyses when a standard method is criticized, explicit norms for what constitutes robustness, or a culture in which the community revisits and revises empirical conclusions.
This gap is most evident in the genre expectations of our papers. The typical “implications” section in CHI is written as if it were a policy memo, even when the evidentiary base is modest: a lab study with a convenience sample, a deployment in a single setting, or interviews with a dozen participants. None of these is an invalid method. But they support a particular kind of claim, under particular scope conditions. When we glide from “here is what we observed in this context” to “here is how platforms serving millions should behave,” we invite exactly the kind of scrutiny that economics seminars institutionalize: what is the counterfactual, what is the estimand, what is the uncertainty, and what would change your conclusion? This mismatch between our ambition and the discipline of our inferential machinery undermines CS’s ability to shape the real world and weakens its credibility with policymakers and other audiences trained to ask such questions.
But the hole is even deeper, and there are now problems even in the most “yardstickable” subfields. The trouble is that the same properties that make this style powerful also make it easy to confuse local wins for general truths. A benchmark score indicates that a method performed well under a specific evaluation protocol. When the field treats “state of the art” as synonymous with “we now understand the phenomenon,” it blurs the distinction between works here and works in general, and between what works and what is true. Worse, the benchmark can quietly become the target: we optimize for the metric, and then retrofit narratives about the underlying capability.
If you want examples here, look no further than the COVID craze, the few months where researchers from every field (including myself) abandoned whatever they were doing to do “a COVID-related” paper. In ML, the result was a flood of models with glossy AUCs and clean-looking tables, but surprisingly little durable knowledge. Roberts et al. (2021) argued that much of the literature suffered from predictable methodological pitfalls—dataset bias, leakage, non-representative sampling, and weak evaluation protocols. These were severe enough that, in their study, none of the 2,212 models proposed to detect and prognosticate COVID-19 from chest radiographs and CT scans were considered potentially clinically useful. Take a second to consider this. It is just insane. 2,212 papers were misled by creating their own “local” win through a benchmark and ignoring the real problem they were supposed to be solving.
In this case, the “yardstick” is still somewhat crisp, and you could imagine a responsible researcher engaging more seriously with the literature and getting something worthwhile. Yet, machine learning and computer science, more generally, are increasingly encroaching on domains where the “yardstick” is no longer crisp and where there is not a single yardstick, but multiple. Generative AI has dragged the field toward tasks that are more abstract, open-ended, and value-laden: helpfulness, harmlessness, truthfulness, creativity, persuasion, alignment, “good judgment.”
As a result, the build-and-test engine begins to wobble: we still want shared yardsticks, but the target is moving, and the metric often serves as a proxy for something we cannot directly observe. The field responds in its characteristic way, by building new benchmarks, rubrics, and evaluator models. Still, under this new regimen, it is much easier for “what scores well” to drift away from “what is actually good,” and much harder to know whether an apparent improvement reflects a real capability gain or simply better exploitation of the measurement procedure.
Why can’t we do both?
This is where the describe-and-defend approach to empiricism offers an essential lesson for CS. It provides a shared language for what a claim entails: its assumptions, counterfactuals, uncertainty, and scope. Without that language, the community tends to reach for rituals as a substitute for reasoning: more checklists, more templates, more badges. These rituals often help. But they are a thin proxy for the greater skill of asking: what exactly does this result establish, and what would it take for it to stop being true?
A recent position paper by Wallach et al. (2025) makes this point, but aimed squarely at the GenAI moment: evaluating generative AI systems is a social science measurement challenge. They argue that many GenAI evaluations amount to “apples-to-oranges comparisons,” and that the ML community would benefit from importing measurement theory: being explicit about what construct you claim to measure (capabilities, behaviors, impacts), how your instrument operationalizes it, and what validity evidence would justify treating the result as knowledge rather than noise. \
But before we get too trigger-happy about the conclusion that “we should all be social scientists now,” it is worth scrutinizing describe-and-defend empiricism for the messy beast it also is. For one, it is often slow. When the unit of progress is a defensible claim about the world, you pay for credibility in calendar time: collecting better data, negotiating access, validating measures, and stress-testing identification strategies. And even then, the best you can sometimes do is a careful estimate under plausible assumptions, but never fully provable.
Worse, describe-and-defend carries its own failure modes. When the reward is a clean claim, it can incentivize overconfident narratives, specification search, and knife-edge identification strategies that look airtight on paper but are fragile in practice. Robustness checks help, but they can also become rituals of their own, performed because reviewers expect them, not because they genuinely change anyone’s mind. Even credible causal estimates can be narrow: a precisely defended effect in one setting that is difficult to generalize, interpret mechanistically, or translate into design guidance.
The punchline is not “CS should become economics” or “every paper needs a preregistration.” The punchline is that CS needs to become bilingual. We should keep build-and-test empiricism—its speed, openness, and compounding progress are real achievements—but we should supplement it with a second language of empiricism that lets us state and evaluate claims with more precision. That language helps us separate exploration from confirmation, local wins from generalization, and performance improvements from causal explanation. In other words: keep the engine, but upgrade the epistemology.
What would this look like in practice? It starts with teaching a small set of concepts that claim-centered fields treat as basic literacy: what is the estimand we are trying to learn, what is the counterfactual, what assumptions connect the data we have to the claim we want, what are the threats to identification, what uncertainty do we actually have, and what robustness checks would meaningfully stress the conclusion. It also means taking measurement more seriously in settings where CS is rapidly expanding (especially in generative AI) by being explicit about constructs, validity, and scope.
Finally, it requires a cultural shift in the rewards we value. Not every paper needs to make a sweeping claim; sometimes the honest contribution is a careful measurement instrument, a negative result, or a boundary condition that shows where an approach breaks. And when a methodology is shown to be flawed, we should normalize reanalysis as part of the scientific process. If build-and-test is how CS moves fast, then principled inference is how it learns what it has actually learned.

