Henderson’s first law of econometrics reads:
When you read an econometric study done after 2005, the probability that the researcher has failed to take into account an objection that a non-economist will think of is close to zero. (Source)
I dare to propose a similar law for machine learning:
When an economist reads (and understands) an empirical machine learning study done after 2022, the probability that they will think of an objection that the researcher has failed to take into account is close to one.
And why is that? Because the two fields treat empiricism in opposite ways. Econometrics was forged in the crucible of skepticism. Every paper is a defensive war against omitted variables, selection bias, and endogeneity. Its practitioners have been trained to see identification not as a formality, but as a matter of survival. Sit in on one of their seminars, and you’ll witness the kind of interrogation that, in Computer Science, would pass for an ambush.
In machine learning, by contrast, the prevailing norm is demonstration, not falsification. Success is typically measured by predictive performance on a benchmark, rather than by causal clarity or robustness to alternative specifications. Yet, papers in machine learning increasingly treat models as behavioral systems to be studied experimentally (“Can Large Language Models…”). These papers make causal claims about systems—about how model behavior changes with prompts, environments, or training regimes—yet they timidly engage with the identification concerns that would accompany such claims in the empirical sciences.
This is not a machine learning problem alone—it permeates Computer Science, from HCI to Security and Privacy. Perhaps the reason lies in the field’s origins: a discipline born from the union of system building and theorem proving, where progress is demonstrated through construction, not systematic observation.
So what? I don’t envy economists who spend five years publishing a paper. But I do believe Computer Science would benefit from adopting more of the empirical culture found in disciplines like economics and political science. The easiest place to start is education. Next semester, I’ll be teaching a graduate seminar titled Empirical Research Methods for Computer Science, partly because I want my students to have these tools—and partly because I want to ask a deeper question: what should the core empirics curriculum of Computer Science look like?
Below is my sketch, which I've spent a week working on.
First, we need to provide students with a vocabulary for causality. Many of the questions that computer scientists now confront—Does fine-tuning a model change its fairness? What is the effect of personalization on engagement?—are fundamentally causal questions. Yet few students have the conceptual tools to pose these questions precisely, let alone to answer them. Our students need a version of this material focusing on how to design credible experiments, reason about identification, and interpret evidence with humility.
Second, we need to teach students about regression. But not as a tool for prediction, as it is taught in machine learning courses. We need to teach them regression as a tool for inference: a way to estimate relationships, test hypotheses, and reason about uncertainty. Computer science students rarely learn to read a regression table, let alone to interrogate one. Yet these habits—thinking in terms of confounders, robustness checks, and sensitivity analyses—are what make empirical results interpretable and cumulative.
Third, we need to teach students about benchmarks. Benchmarks have long served as the de facto empirical infrastructure of Computer Science. Yet few students are ever taught to think critically about them. What does a benchmark actually measure? When does improving performance reflect genuine scientific insight, and when does it reflect overfitting to a proxy task? Treating benchmarking as a scientific problem in its own right—concerned with validity, reliability, and construct design—would help students see that measurement is a form of theorizing about the world.
Fourth, we need to teach students about experimental design in the age of Computer Science. Our field now runs experiments at a scale and speed unimaginable in most other sciences, yet the principles remain the same: randomization, power, validity, and ethics. Students should learn how to design credible experiments, estimate statistical power, and ensure data quality in online settings—from platforms like Prolific to in-product A/B tests. These practices are not bureaucratic hurdles; they are what make empirical claims reproducible and trustworthy.
None of these things is “hard,” and good material exists for a lot of them. Brady Neal has a great course on causal inference; Moritz Hardt has an upcoming book on benchmarks; and some recent Economics textbooks provide a very gentle introduction to causality and quasi-experimental methods. Yet for empirical rigor to take root in Computer Science, it needs to become part of how we train students to think—not just a set of optional skills. We need to signal that understanding identification, regression, and experimental design is as fundamental to being a computer scientist as knowing how to optimize an algorithm or prove a theorem. Only then can our field move from demonstration to explanation, and from performance to understanding.