The Quiet Failure Of LLM Scientific Benchmarks, And Three New Ones That Fix It

Every benchmark published before this month for scientific LLMs was measuring recall. None were measuring judgement.

For two years we have run models through MMLU, GSM8K, ARC, and every variant of fact recall and stepwise reasoning test we can invent. We have ranked models, published leaderboards, and built entire autonomous research agent stacks on top of those scores. All the while, no one had properly tested the single capability that actually matters for any scientific use case: can this model tell a good idea from a bad one.

Three papers dropped on arXiv this week that change this. All three avoid the standard benchmark mistakes. All three return results that are deeply inconvenient for every vendor selling AI scientist products. None of the results have made it to public discussion yet. This is what they found.

We have been measuring the wrong thing

All existing scientific benchmarks share one unstated, invalid assumption: the test problem is well formed, all required information is provided, and there exists exactly one correct answer.

That is never the situation in actual science. In real research you have incomplete information, conflicting evidence, most ideas are bad, and most paths lead nowhere. Benchmarks that do not replicate this environment do not measure anything relevant to deployment. They measure how well a model can perform on artificial school tests, not how it will perform when given real work.

All three new benchmarks reject this model. They are built for messy, ambiguous, realistic conditions.

SoundnessBench: The optimism bias no one tested for

SoundnessBench is built from 1,099 full ICLR 2024 and 2025 research proposals. Not accepted papers. Submissions. Every one was scored by at least three program committee reviewers on explicit methodological soundness. The authors did not use final accept/reject outcomes. They used the raw, anonymized reviewer subscores for proposal soundness, before rebuttal, before committee politics. This is the single best labeled dataset of good and bad research ideas that has ever been released publicly.

The benchmark task is simple. Give the model the full unmodified proposal text. Ask it to rate the methodological soundness on the same 1-5 scale used by reviewers.

Across 12 frontier LLMs tested, every single model had a very large, very consistent bias. Every model rated bad proposals much higher than human reviewers did. False positive rates for low soundness proposals ran between 41% and 68% under standard prompting. No model had a false positive rate below 37% even after extensive prompt tuning.

Humans get this wrong too. Human reviewers have a 22% false positive rate on this exact same set. Every LLM tested was at least twice as likely to approve a bad research idea as a human reviewer.

What the ICLR submission data actually shows

You can fix bias right? Just tell the model to be strict. Tell it to look for flaws. Tell it to act like a grumpy ICLR reviewer.

This works, but not in the way you want. Aggressive critical prompting almost eliminates false positives. It also triples the false negative rate. Models will start rejecting every proposal. There is no prompt setting found that brings both error rates down to human levels. There is not even a setting that brings them within a factor of two.

The authors ran a full grid search across 17 prompt variants, temperature settings, and chain of thought strategies. No combination closed the gap. The error does not reduce. It just moves from one side to the other.

This is not a prompting problem. This is a capability limit. Current LLMs do not have an internal model of what makes an experiment valid. They have a model of what a good research proposal sounds like. They cannot distinguish between correct methodology and convincing prose.

Contamination controls that every benchmark should copy

The most important part of SoundnessBench is not the results. It is the controls.

The authors ran four separate contamination tests. They removed all identifying metadata. They paraphrased every proposal. They shuffled section order. They removed all citations. None of these changed model scores by more than 3%.

They also tested models on proposals that were submitted, but never posted anywhere public. Scores were identical. This is not memorization. This is actual inherent behaviour of the models.

No major LLM benchmark published in the last two years has run this set of controls. Every leaderboard you have seen should be considered untrusted until they do.

MedCase-Structured: The deployment penalty

The second paper addresses an even quieter failure mode. Every clinical LLM benchmark runs on plain text case notes. No production clinical system uses plain text case notes. Every modern hospital runs on HL7 FHIR.

FHIR is not just a file format. It is structured, normalized, coded, distributed data. Lab results come as separate resources. Medication history is linked. Allergies live in a different bundle. There is no prose summary. No one writes up a nice paragraph description of the patient for the benefit of the LLM.

The authors built a validated pipeline to convert existing clinical case benchmarks into standards compliant FHIR R4 bundles. They produced 1240 cases, with 82.5% passing full FHIR schema and terminology validation.

Then they ran the exact same LLMs that score 85%+ accuracy on the plain text version of this benchmark. On FHIR input, every model dropped between 19 and 31 percentage points in diagnostic accuracy.

GPT-5 scored 87% on plain text. It scored 58% on structured FHIR.

No vendor has ever reported this number. Every clinical LLM demo you have seen uses plain text inputs. None are tested against the actual data format they will receive in production.

FHIR is not just another input format

This is not a parsing bug. The models correctly parse the FHIR json. They correctly extract all individual fields. They just cannot reason over them when they are presented in the structured format used by real systems.

The authors ran control tests where they took the exact same data, converted it back to plain text prose, and passed it to the model. Accuracy returned immediately to baseline. Exactly the same information, exactly the same model, only the representation changed.

This is a fundamental property of how LLMs reason. They were trained on prose. They reason much better over prose. Any benchmark that does not use the exact input representation that will be used in deployment is measuring nothing.

ProjectionBench: Reasoning under partial information

The third benchmark tests the thing everyone actually wants from an AI scientist: can it come up with the right hypothesis before you run the experiment.

ProjectionBench works differently from every other benchmark. There is no full context. The model gets only the research question. It is asked to generate testable hypotheses. Then, piece by piece, additional experimental details are disclosed. At each step the model is allowed to update its hypothesis.

Hypotheses are scored not on exact wording match, but on alignment of atomic factual claims with the eventual published result.

This is how actual science works. You start with a question. You get a little bit of data. You update. You never have all the information up front.

The alignment dropoff no one was measuring

Across 45 recent papers across three materials science domains, the results are clear. All frontier models perform very well when given full experimental context. Performance falls off a cliff as you remove information.

Gemini 3.1 Pro maintains 0.78 F1 alignment with full context. At 25% context disclosure it drops to 0.31.

GPT-5.4 is the only model that holds up. It maintains 0.70 F1 even when given nothing but the original research question. That is an impressive result. It is also still worse than the average postdoc working in the field, who scores approximately 0.78 on the same task.

Importantly, there was almost no correlation between performance on this benchmark and performance on any standard general purpose LLM benchmark. You cannot predict how well a model will generate hypotheses from its MMLU score.

What all three benchmarks agree on

All three papers arrived independently. None of the authors knew each other. All three arrived at almost exactly the same set of conclusions.

First, general purpose LLM leaderboard scores are almost completely uninformative for scientific use cases. Models that are separated by 5 points on MMLU can differ by 30 points on these tasks. Rank order reverses completely.

Second, there are no quick fixes. Prompting changes error distribution. It does not reduce total error. Fine tuning has not been tested, but all authors note that there is no existing evidence it will close these gaps.

Third, every single claim made over the last 12 months about autonomous AI research agents was based on evaluation that did not test for these failure modes. All existing agent demonstrations are heavily selected for success cases. None have been tested against the failure modes measured here.

Stop building agents before you fix evaluation

We have this backwards right now. We are building increasingly complex agent architectures, with planning loops, memory systems, tool use and peer review stages, on top of base models that cannot even reliably tell a good idea from a bad one.

You cannot build a reliable system on top of a component that will approve 60% of bad research proposals. You can build impressive demos. You can raise funding. You cannot build something that will actually accelerate science.

These three benchmarks are not the final answer. They are the first benchmarks that are even asking the right questions.

What comes next

Over the next six months every team building scientific AI will run these tests. Most will not like the results. Many will try to game the benchmarks. That is normal.

The important shift is that we are no longer measuring what models know. We have started measuring what they can judge. That is the line between a search engine and a scientist.

We have not crossed that line yet. We now at least have a way to measure how far away we are.

The Quiet Failure Of LLM Scientific Benchmarks, And Three New Ones That Fix It

We have been measuring the wrong thing ​

SoundnessBench: The optimism bias no one tested for ​

What the ICLR submission data actually shows ​

Contamination controls that every benchmark should copy ​

MedCase-Structured: The deployment penalty ​

FHIR is not just another input format ​

ProjectionBench: Reasoning under partial information ​

The alignment dropoff no one was measuring ​

What all three benchmarks agree on ​

Stop building agents before you fix evaluation ​

What comes next ​