Skip to content

Nobody Actually Knows If LLMs Can Reason. These Four Papers Change That.

#llm-reasoning #benchmarking #chain-of-thought #fine-tuning #reward-models

For two years everyone has been measuring LLM reasoning wrong. Every leaderboard you see, every benchmark result posted to social media, every release blog post uses exactly the same wrong metric. They count whether the final answer is correct. That is it.

This week four papers landed on arXiv within 72 hours of each other, all pointing to exactly the same conclusion. We have been optimizing for the wrong thing, and as a result we have built models that are extremely good at looking like they reason, and extremely bad at actually doing it.

The dead end of reasoning benchmarks

Every major reasoning benchmark in use today was constructed the same way. Researchers collect standard textbook problems, format them consistently, remove ambiguous wording, strip out distracting information. Then they run models, count right answers, and publish rankings.

This methodology produces very clean numbers. It produces very repeatable numbers. It does not produce numbers that tell you anything about how a model will perform outside the benchmark.

We have known this for a while. We had anecdotes. We had twitter threads of gotcha prompts. We did not have controlled, replicated, large scale evidence. Now we do.

96% accuracy until you change one word

The first paper tested probabilistic reasoning across 8 state of the art models, including GPT-4o, Claude 3 Opus and DeepSeek R1.

The authors built two matched datasets. The first contained standard textbook probability problems, exactly the type used on every existing benchmark. Across all models average accuracy was 0.96. Any reasonable observer would conclude these models have fully mastered basic discrete probability.

The second dataset contained mathematically identical problems, rephrased to avoid canonical wording, constructed to trigger common human heuristic biases. Average accuracy dropped to 0.59.

That is not a small edge case degradation. That is falling from near perfect performance to worse than an untrained human guesser, without changing a single underlying fact about the problem.

Performance dropped 22% when the authors just replaced standard dice wording with equivalent marbles wording. Performance dropped 34% when they added one irrelevant true sentence to the problem description. No model was immune. Not one.

Real world problems never look like textbook exercises. They have distracting details. They use unusual wording. They are not the exact problem you saw ten thousand times in the training set. On this class of problem, every model we currently have breaks.

Topological mimicry: the fake reasoning we built

The second paper is the most important work published on LLM reasoning to date. The authors performed a full anatomical comparison between DeepSeek R1 and human competitors across every problem from the 2025 AIME mathematics competition.

They collected and annotated 10,247 full reasoning traces, tagging every individual step into one of five functional categories: Analysis, Inference, Branch, Backtrace, Reflection.

Human solutions have a very clean structure. 87% of human steps alternate strictly between analysis and deduction. Humans backtrack on average 1.2 times per problem. When humans reflect, they do it exactly once, immediately before committing to a final answer.

DeepSeek R1 does not do this. Even on traces that produced the correct final answer, the model performed on average 7.8 verification steps that did not advance the logical argument. 41% of all steps were local repeated checks that restated the same intermediate result without modification. The model loops. It spins its wheels. It produces paragraphs of text that contain no logical errors, that look exactly like reasoning, that do not move the problem forward at all.

The authors call this topological mimicry. The model has learned the shape of good reasoning. It has learned that long traces with reflection language get high reward. It has not learned that reflection is supposed to fix mistakes.

This is not a bug. This is exactly what we trained it to do. Every process reward model currently in production rewards longer traces. Every reward model gives points for writing "let me check that again". None of them check if checking actually accomplished anything.

The two signals that actually indicate real reasoning

All is not lost. The same study found two signals that perfectly separate genuine reasoning from mimicry, across every model tested.

First: Successful traces use branching and backtracking within a very narrow stable band. Failed traces either never backtrack at all, or backtrack constantly. There is almost no overlap between the distributions.

Second: Reflection only improves outcomes when it appears immediately after an inference step. Reflection that appears during the analysis phase is always useless. It will always be just spinning.

You can score any reasoning trace right now with these two rules, and you will predict final correctness better than any existing process reward model. You can do this without ever reading the actual numbers or logic written in the trace.

Nobody noticed this before. Nobody was looking at the structure. Everyone was just counting right answers.

We have been rewarding the wrong prefixes

The third paper attacks the core assumption behind all modern process reward models.

All existing PRMs score each reasoning step in isolation. They ask: is this step mathematically correct?

This is the wrong question. The correct question is: if you stop here, how much more likely is the model to eventually get the right answer?

The authors define this value as prefix gain. They built a reward model that does not check step correctness at all. It just measures empirically what fraction of the time a given prefix leads to a correct final answer.

It works better. On best-of-16 selection it outperforms every existing PRM by 11%. On beam search it gains 18%. Most importantly, it almost completely eliminates spinning wheel traces. The model stops writing useless verification steps almost immediately, because those steps reliably predict that the trace will fail.

This is an embarrassing result. For three years the entire field has been building ever more complex step-level correctness reward models. It turns out you can beat all of them by just measuring what actually works.

Stop forcing models to copy your solutions

The fourth paper addresses supervised fine tuning, the base step used for every reasoning model released today.

Standard SFT for reasoning works like this: you collect good human solutions. You train the model to output exactly those solutions.

Everyone does this. Everyone has done this since the first Chain-of-Thought papers. It is obviously wrong.

Reasoning is not a single path. There are usually ten different valid ways to solve any non trivial problem. If you force the model to copy exactly one path, you are not teaching it to reason. You are teaching it to imitate your reasoning. Worse, you are actively suppressing any reasoning strategy the model might have discovered on its own.

The proposed method, Rollout-Adaptive SFT, fixes this. For every problem, it first runs the model 32 times. If the model already solves the problem reliably on its own, you turn down the expert loss weight. You let it use its own solutions. Only when the model cannot solve the problem at all do you show it the human demonstration.

It beats vanilla SFT on every benchmark, by between 4 and 9 percent. It beats RLHF on most of them. It does this with no extra compute, no extra data, just a different loss function.

Most tellingly: models trained with RASFT produce 70% fewer useless spinning steps. They stop mimicking. They start solving problems the way they want to.

Everything we built in 2025 was a local maximum

All four papers agree on exactly one thing. The entire generation of reasoning models released over the last 12 months are stuck on a terrible local maximum.

We optimized for correct final answers. We optimized for traces that look like human reasoning. We got exactly what we asked for. We got models that can nail every standard benchmark, that write beautiful convincing chain of thought, that fall apart completely when you rephrase the question, that loop endlessly, that cannot handle a single distracting sentence.

This is not incremental progress. This is a dead end. And almost nobody noticed until this week.

What comes next

We now know what to fix. None of this requires larger models. None of this requires more training data. None of this requires more compute. All of this requires that we stop measuring the wrong things.

First, retire final answer correctness as the primary evaluation metric. Start evaluating trace structure. Use the backtracking rate signal. Measure prefix gain.

Second, stop training reward models on step correctness. Train them on actual outcome. It is simpler and it works better.

Third, stop forcing models to copy human solutions. Let them reason their own way. Only guide them when they cannot proceed.

Fourth, throw out every existing probability reasoning benchmark. They are all useless.

Closing observation

There is one detail from the dice paper that no one has commented on yet.

When the authors ran the counterintuitive problems without Chain-of-Thought, when they just asked the model to output the answer directly, average accuracy went up.

Not down. Up. 7 percentage points.

Chain-of-Thought, the single most widely adopted technique for improving reasoning, makes these problems worse. Because when you ask the model to show its work, it does not start reasoning. It starts performing reasoning. It starts acting out the script it was trained to produce. And that script makes it stupid.

That is the core truth that all four of these papers are circling. We did not teach models to reason. We taught them to give a very convincing performance of reasoning.

And for a very long time, we could not tell the difference.


References

  1. How reliable are LLMs when it comes to playing dice? http://arxiv.org/abs/2606.07515v1
  2. A Comprehensive Anatomy of Human and DeepSeek-R1 LLM Mathematical Reasoning http://arxiv.org/abs/2606.07410v1
  3. From Correctness to Utility: Gain-Based Prefix Evaluation for LLM Reasoning http://arxiv.org/abs/2606.07190v1
  4. RASFT: Rollout-Adaptive Supervised Fine-Tuning for Reasoning http://arxiv.org/abs/2606.07006v1