Labeling Data with Language Models: Trick or Treat?
Large language models are now labeling data for us. In dozens of new studies, researchers have quietly replaced undergrad coders and Mechanical Turkers with GPT4 or Gemini 2.5. What began as a convenience is fast becoming a methodological revolution! And one that could reshape how social scientists measure the world.
As I’ve been catching up with the literature, I see three concurrent “waves” of work around this idea.
Early papers were awed by LMs’ annotating abilities. Back in 2023, Gillardi et al. showed that “ChatGPT outperforms crowd workers for several annotation tasks, including relevance, stance, topics, and frame detection.” And it did so at a fraction of the cost—roughly thirty times cheaper than human annotators. Similar results quickly followed, suggesting that LMs might transform quantitative social science (see Törnberg, or this recent tweet). In ML and NLP circles, this trend became known as LM-as-a-judge (Zheng et al.; Kim et al.)
The next wave asked a harder question: how do we do this right? Ziems et al. systematically analyzed LM annotations across models and tasks, producing general usage guidelines. Others accepted that LM labels will never be perfect and proposed ways to account for their bias. Two go-to methods here are DSL by Egami et al. and CDI by Gligorić et al. The basic idea of these approaches is to use a smaller set of human annotations to “ground” the LM-generated labels. Once you model the LM bias relative to the human sample, you can then correct your final estimates. Calderon et al. go a slightly different route. They use a sample of data annotated multiple times in an “alternative annotator test,” which justifies using an LM rather than a human annotator (because it outperforms annotators in aligning with the majority vote).
Finally, a “third wave” of work has currently raised the alarm that LMs-as-annotators might lead to bad science. In perhaps my favorite paper of this wave, Barrie et al. show that even controlling the temperature, using proprietary models as annotators leads to unacceptably high variance in performance over the course of many months. Yang et al. replicated 14 recently published papers to find that “LLM annotations have low intercoder reliability with the original annotations and moderate reliability among the LLMs themselves.” Baumann et al. find that conclusions drawn from LLM annotations often differ from those from human annotations, depending on the LLM used and its configuration. They define this term “LM hacking” as mistakes that “propagate to downstream analyses and cause Type I (false positive), Type II (false negative), Type S (wrong sign), or Type M (exaggerated effect) errors.”
Okay, so where do we go from here?
First, we need to clarify what problem we’re actually solving.
For example, the aforementioned LM hacking paper by Baumann et al. groups all errors under the same umbrella, e.g., they consider not finding an effect when one exists as “LM hacking.” The paper then concludes that methods such as DSL and CDI are not “enough” to address LM hacking. This conflates error type with scientific risk. DSL and CDI explicitly treat LM labels as noisy measurements: they use a small human “gold” set to estimate measurement error and then debias or propagate that uncertainty. By design, they trade some power (more Type II) to reduce false discovery (Type I)—the same logic behind strict multiple-comparison controls (e.g., Bonferroni increases Type II while lowering FWER)
So, really, the meta-science question here is about efficiency, not purity. In the social sciences (and perhaps in general), we are currently constrained to explore only a very small subset of the total space of interesting questions worth answering (see this banger; or this one). So let's ask: how can we produce scientific knowledge more efficiently?
Suppose a research group has an annotating budget S and a series of K questions they want answered. On the extremes, they can approach this in two ways:
They can spend S/K dollars annotating data for each question they are interested in, and use DSL or an equivalent method to generalize annotations to the whole sample.
They can order the top K questions by interest and, for the top ones, spend the budget S annotating a sample large enough to conduct the analysis.
Of course, an optimal strategy is probably something in between. But under which broad regime can science progress more efficiently? If the calibration methods are good enough, probably something closer to the first extreme. And there’s nothing uniquely “LM-ish” about this pipeline—the same logic would have applied in the BERT era, though the subset we needed to hand-label might have been larger then.
Second, we should clarify the threat model of malicious actors who may engage in bad science.
It’s worth remembering that we’ve been here before. The current anxiety around LM use echoes the credibility revolution in psychology and social science during the 2010s. Back then, the problem wasn’t that everyone was faking data (though a few were) but that researchers had too many degrees of freedom: selective reporting, flexible stopping rules, and post-hoc hypothesizing made false positives almost inevitable.
However, when it comes to curbing truly malicious use of LMs, I wonder whether LMs create “new degrees” of freedom for cheating that matter. After all, exclusions and minor tweaks to data that can impact the outcome were always possible. Maybe the threat of “intentional” LM hacking is that LMs create a scenario where this is truly impossible to track (without access to compute logs), but I wonder whether the delta here is substantial to matter (as, frankly, I guess most people could change their data and be fine, as long as you don’t do it in Excel).
Perhaps the threat is of a second-order nature, e.g., threatening the way we conduct science, making us more lazy and dishonest when delegating tasks to LMs, or ushering in a phase of scientific inquiry in which “we produce more and understand less.” But still, it seems to me that research addressing these threat models would provide the strongest evidence by finding evidence of said misuse, rather than hypothesizing it.
Third, I think we need to embrace the fact that human labels can also suck. And that failing to reproduce studies does not necessarily tell you that LMs are “wrong.”
In Yang et al., it is worth noting that most of the annotated labels studied are a mix of human and previous ML methods (e.g., BERT). They find that interannotator agreement between strong LMs is around 0.6. But we’re not even sure what human interannotator agreement is across the tasks presented, to begin with. Authors find another report suggesting that n=20 studies reporting it have an average agreement of 0.73, but the figure smells of selection bias.
This ambiguity points to a deeper problem: when LM and human annotations diverge, we have no reliable way to know which side is wrong—or whether “wrong” even applies. Disagreement doesn’t reveal error so much as the instability of the categories themselves. In many social science tasks, ground truth is not discovered but negotiated through convention, as human annotation is too costly to do over and over again (just to measure variability). LMs are cheap, however, and may thus expose, rather than cause, the fuzziness of our constructs. Systematic gaps between human and model judgments can reflect algorithmic bias, but just as easily fatigue, inconsistency, or ideological drift among human coders. The uncomfortable reality is that for many annotation tasks, truth has always been approximate. Replication failures with LMs don’t necessarily prove the models are broken; they remind us that our human baselines were never as solid as we liked to think.
My very centrist take is that labeling with LMs is neither trick nor treat. The real challenge is not replacement but calibration—whether we can model and report uncertainty with the same rigor we expect from any other instrument. LMs haven’t made annotation noisier; they’ve made its noise impossible to ignore. By exposing the randomness, inconsistency, and subjectivity that have long been hidden in human labeling, they force social science to confront its own measurement fragility. If we take that challenge seriously, LM labeling could make the field not lazier, but more transparent and scientific about its own uncertainty.
Acknowledgements: I thank Kristina Gligorić, Paul Röttger, and Joachim Baumann for feedback on this blog post.

