AI Is Already Giving Medical Conclusions. Are They Any Good?

Jun 12, 2026

Today I will share a guest post by my student Hayoung Jung — he just started his own Substack, Systems Under Test, which you may want to follow if you are interested in evaluation/measurement Science.

Hayoung is a Ph.D. student in computer science at Princeton University, co-advised by Manoel Horta Ribeiro and Aleksandra Korolova. His research broadly focuses on advancing inclusive AI technologies and online platforms to better serve society and communities often overlooked in system development. Drawing on an interdisciplinary background, Hayoung develops technical frameworks and methods grounded in social science theories, with two main goals: auditing AI systems and online platforms and studying social phenomena, such as community norms, through language and online behavior. He completed his undergraduate degrees in computer science and political science, and his M.S. in computer science, at the University of Washington.

Recently, I was talking with some family members from South Korea who mentioned their back pain. My immediate question: “What did the doctor say?” Healthcare is highly accessible and affordable in South Korea, so I assumed they had already seen one.

Nope. They asked ChatGPT.

In all honesty, this was not truly surprising given how useful these models are. But the moment captures a growing social phenomenon happening everywhere. AI systems are becoming the first stop for health and scientific questions, even in countries where professional care is available and accessible.

And people are not just asking these systems to retrieve webpages or list sources, as they might in traditional search engines. Agentic systems, such as Google AI Overview, OpenEvidence, and OpenAI Deep Research, synthesize information from multiple sources and present immediate conclusions to users’ questions in real time. Increasingly, users are directly asking, What is my diagnosis? What are the best treatment options? What should I do next?

Reports suggest this is happening across audiences. Laypeople ask AI systems about symptoms, treatments, and scientific claims, while more than 80% of U.S. physicians use them in their professional workflows, including to explore medical questions and support decision-making. When AI systems are becoming the first (or even the only) stop for health and scientific questions, are they even reliable at synthesizing scientific evidence into conclusions that people may actually act on?

A Benchmark for Scientific Synthesis

To answer this, I worked with my amazing PhD advisors Manoel Horta Ribeiro and Aleksandra Korolova (who also have their own Substacks here and here) to create a benchmark for evaluating how well current AI agents synthesize scientific conclusions from the open web.

Scientific conclusion synthesis requires several steps. An agent must retrieve relevant evidence from the open web, filter out irrelevant or low-quality sources, reason across multiple studies, weigh conflicting findings, preserve uncertainty, and synthesize a long-form conclusion. Importantly, these kinds of tasks are long-horizon and open-ended, as expert scientists often spend months searching the literature on the open web, evaluating studies, and synthesizing careful conclusions about what the evidence in the field actually supports.

To evaluate this, we built SciConBench, a large-scale benchmark of 9.11K scientific questions paired with expert-written conclusions from Cochrane systematic reviews, a gold standard in evidence-based medicine. Each SciConBench task asks an AI agent to use web tools to answer a scientific question with a paragraph-length conclusion, which we compare against the corresponding expert-written Cochrane conclusion. Importantly, SciConBench is a live benchmark: it is continuously updated as new Cochrane reviews are published, enabling timely evaluations and reducing benchmark leakage as new models are trained on recent web data.

illustrated graphic of the SciConBench — Overview of SciConBench. We evaluate whether AI agents can use tools to synthesize scientific conclusions from the open web, without simply retrieving the expert-written answer online. We compare AI-generated conclusions against expert-written Cochrane conclusions by measuring how accurate and complete their factuality is. Even under this controlled setup, frontier AI agents struggle to synthesize reliable scientific conclusions.

The Leakage Problem

While running SciConBench, we ran into a surprising issue from looking at our agent logs: AI agents were explicitly looking for the benchmark answers directly from Cochrane review articles, even when we instructed them not to in the system prompt. Anthropic recently released a neat blog on this phenomenon called “evaluation awareness,” in which these models would know they are being evaluated and explicitly look for answers online.

As models become increasingly capable, a major challenge in evaluating web-enabled agents is that they can often find the answer directly. If a benchmark question comes from a published systematic review, an agent with web access may simply retrieve the review itself, or another webpage that covers its conclusion (e.g., news coverage). At that point, the task is no longer about synthesizing the scientific evidence from scratch, but rather merely retrieving the ground-truth answer (a much easier task!). The model may look impressive, but we would not be measuring the capability we actually care about.

To address this, we built SciConHarness, a clean-room evaluation harness. This evaluation harness enforces the clean-room protocol, ensuring agents have controlled access to web search, browsing, and paper search tools, while filtering out ground-truth artifacts such as Cochrane pages and review articles that could leak the answer. This lets us evaluate whether the agent can synthesize the conclusion from the open-web evidence, rather than shortcutting to the already-written expert answer.

Measuring factual quality

In our study, we work with doctors to validate every component of our benchmark creation and evaluation pipeline. After an AI agent synthesizes a conclusion from the open web, we evaluate their conclusions using our expert-validated factual evaluation pipeline. Instead of judging the whole paragraph at once, the idea is we decompose both the AI-generated conclusion and the expert-written reference conclusion into a series of facts, e.g., statements containing a single piece of information. Then, we measure two things:

Factual precision (correctness): Are the facts in the AI-generated conclusion supported by the reference, or do they contradict it?
Factual recall (coverage): Does the AI-generated conclusion cover the key facts from the reference conclusion needed to answer the question?

We use these two metrics because a scientific conclusion can fail in different ways. A conclusion may contain incorrect claims – for example, by overstating weak evidence or flipping the direction of a treatment effect. Alternatively, it may be mostly true but incomplete, omitting key facts or caveats that matter for decision-making. To capture both correctness and completeness, we also report Factual F1, the harmonic mean of factual precision and factual recall. In other words, a system can only score highly on F1 if it performs well on both dimensions: it must avoid making unsupported or contradictory claims, while also covering the key facts needed to answer the question. All metrics range from 0 to 1, with higher being better.

So how do these AI agents perform?

AI benchmark data — Our benchmark results. Note that each metric ranges from 0 to 1, with higher being better! We test across frontier models and deep research agents (DR) using SciConHarness, where the best score under the clean-room was 0.337 factual F1-score. As shown in \delta_{Clean} F1, we found models and deep research agents consistently decrease in performance when applying the clean-room.

Let’s see the benchmark results above! Across frontier models and deep research agents, synthesizing scientific conclusions remains far from solved. Under clean-room evaluation, which better isolates true synthesis capability, the best-performing agent (OpenAI’s o3-deep-research) achieved only a factual F1 of 0.337. In other words, even the strongest systems struggled to produce conclusions that were both correct and comprehensive with respect to the expert-written Cochrane reviews.

We also found that clean-room evaluation consistently reduced performance. When agents had unrestricted web access (e.g., no clean-room), they performed better. However, when we filtered out ground-truth leakage with our clean-room, their scores consistently dropped. This suggests that some apparent performance in open-web evaluations comes from retrieving benchmark artifacts, not genuinely synthesizing conclusions from evidence.

This leakage issue is important beyond our benchmark. If we evaluate AI agents in environments where they can shortcut and find the answer directly, we may overestimate their real capabilities, especially for high-stakes tasks in health and science.

The deployed agents were also unreliable.

numbers describing benchmarks — We audit consumer-facing agents, like Google AI Overview and OpenEvidence, using our benchmark! Given that these tools are used millions of times in real-world health decision-making, this could result in substantial amounts of incorrect advice given to both clinicians and laypeople.

We also audited consumer-facing agents, including Google AI Overview, Google AI Mode, and OpenEvidence. These agents are already being used by laypeople and clinicians to synthesize health information. OpenEvidence, in particular, is marketed as a “clinical AI copilot for doctors” for “high-stakes decisions” and is used hundreds of millions of times in the medical context.

Looking more closely at the table above, even when these agents had access to the ground-truth review, their conclusions were often incomplete and sometimes contradictory. OpenEvidence performed best among the audited agents, but still covered only about half of the reference facts and produced contradictory claims: in fact, 50.8% of its generated conclusions contained at least one claim that contradicted the Cochrane review.

Google AI Overview and Google AI Mode performed worse, with lower coverage and similarly concerning contradiction rates: 56.3% and 59.0% of their conclusions, respectively, contained at least one contradiction. In many cases, the ground-truth answer was already available online, meaning the models should have been able to identify, retrieve, and prioritize such high-quality sources. This suggests that the failure likely occurred somewhere in the synthesis process, such as evaluating the quality of evidence, integrating high-quality ones, and communicating the evidence correctly.

So what?

Scientific conclusions are compressed decision-making tools. The optimistic view of AI agents is that they will help democratize expertise by synthesizing these scientific conclusions at scale in real-time. A clinician could quickly get up to speed on an unfamiliar condition. A patient, including someone like my own family member with back pain, could determine whether a treatment seems promising. A scientist could accelerate literature review and understand the frontiers of science. A policymaker could synthesize scientific conclusions before making a decision. The vision is compelling.

However, our results suggest that current systems are not yet reliable enough to synthesize scientific conclusions, especially in high-stakes settings like health, where even a single misleading answer can deeply impact stakeholders.

These agents can generate seemingly competent conclusions that omit key information, include unsupported claims, or contradict expert reviews, creating the risk of patients, clinicians, scientists, and policymakers relying on conclusions that do not faithfully reflect the underlying evidence.

Given that these tools are used hundreds of millions of times in health contexts, even modest error rates could translate into a substantial amount of misleading advice or unsafe answers in practice. Our findings suggest that these systems and their use in clinical settings deserve much greater public scrutiny.

While AI agents provide real utility in health and science, we need to be much more precise about what they can and cannot do. With SciConBench, we hope to push agentic evaluation closer to an important real-world task we expect these systems to perform: synthesizing careful scientific conclusions from the open web.

More broadly, we see this work as part of the measurement infrastructure needed for AI systems in high-stakes domains. If these systems are going to be used in medicine and science, we need stronger evaluations of the tasks people actually delegate to them, along with greater transparency from AI providers, including usage data and post-deployment monitoring. Without that transparency, it is difficult to know how often these errors happen in the real world, who is affected, and when they lead to harm.

For now, our results suggest that we should treat these systems less like expert reviewers and more like fallible assistants: useful in some contexts, but requiring careful expert oversight, independent verification, and much stronger evaluation before they are trusted in high-stakes decisions. AI may one day help democratize expertise. But until then, ask a doctor or a scientist before letting the chatbot make the call.

Interested in reading more? Check out our paper!

A guest post by

Hayoung Jung

Computer Science PhD Student at Princeton University. I study AI for health, along with evaluation and measurement science.

Adeola Abdulramon

Jun 30

This is a good read. And I commend the factual proofs, and method carried out in this research. As a software engineer working with coding agents everyday, I realized AI is not at that expert level yet. They required a close monitoring, human review, with lot of instructions. Delegation of tasks using auto-pilot mode is still a big risk.

Doomscrolling Babel

Discussion about this post

Ready for more?