Skip to content

Three New Papers That Fix The Broken Foundations Of Chain Of Thought

#llm-reasoning #chain-of-thought #reinforcement-learning #normalizing-flows #llm-evaluation #grpo

For two years we have been training reasoning LLMs the same broken way. We ask the model to write out steps, check if the final answer is correct, and give the entire trace the same reward. We pretend this works. It does not. It works just well enough that no one stopped to fix it until this month.

The quiet failure of standard CoT training

Every production reasoning model running today uses exactly this training loop. You run supervised fine tuning on reasoning traces, then run GRPO reinforcement learning with a single binary reward assigned to the full generation.

Everyone working in this space knows this system has terrible pathologies. 70% of traces that receive full positive reward contain multiple invalid, irrelevant or actively wrong intermediate steps. The model just lucked into the correct final answer. 30% of traces that receive zero reward had every step correct except a trivial arithmetic error on the final line.

This is not a minor annoyance. This is the single largest waste of training compute in the entire field right now. Gradient variance on standard GRPO CoT training runs is 11x higher than equivalent non reasoning fine tuning jobs. We throw away 9 out of every 10 training samples just because we cannot be bothered to assign reward correctly.

Delayed reward is not just an RL problem

This is not a new problem. Every introductory reinforcement learning textbook will tell you that sparse terminal reward has catastrophic sample efficiency. Everyone accepted this state of affairs for CoT because every proposed fix was worse.

Monte Carlo rollout credit assignment works. It also costs 12-18x the training compute. No one runs that at scale. Attention attribution methods produce pretty heatmaps. They do not correlate with actual causal contribution to the final answer.

For three years the entire field just turned up batch sizes, increased training steps, and ate the variance. Everyone complained about it over drinks. No one published a fix.

RREDCoT: segment level reward redistribution

RREDCoT is the fix. It is also embarrassingly obvious in hindsight.

You do not need to run external rollouts to estimate which steps in a reasoning trace mattered. The model itself already knows.

When generating a CoT trace, at every 4 token segment boundary you run one single extra forward pass. You ask only one question: given everything written up to this point, what probability does this model assign to eventually arriving at the correct final answer?

That is the value estimate. No extra generation. No sampling. No external critic. Just one forward pass per segment.

You then redistribute the final terminal reward proportional to the delta in this estimated value between each step. Steps that made the correct outcome much more likely get most of the reward. Steps that did nothing get almost nothing. Steps that reduced the chance of success get negative reward.

How RREDCoT avoids Monte Carlo overhead

Break down the numbers. Standard Monte Carlo credit assignment requires 8 independent rollouts per step to get an unbiased value estimate. For a 32 step CoT trace that is 256 extra forward passes per training sample.

RREDCoT does 1 extra forward pass per 4 step segment. That is 8 extra passes total. 32x lower overhead.

It is not unbiased. That is the trick. It does not need to be unbiased. It only needs to be better than giving every step exactly the same reward. It is. By a very large margin.

The paper reports 32% lower gradient variance on GSM8K training, 19% higher final accuracy at identical compute budget. No changes to the base model. No extra training data. Just this reward redistribution.

This is the best kind of research. Not proving something optimal. Proving something good enough that every LLM training team will have this implemented by the end of next week.

We never needed to print the reasoning

There is a second dirty secret of Chain of Thought. All that text we force models to output between the question and answer? 90% of it is performative.

The model almost always already knew the next step before it finished writing the last one. It is just required to type it out for our benefit. This is not just waste. It actively harms reasoning. The model will take correct semantic internal state and mangle it to make a grammatically correct sentence. It will avoid good reasoning paths that are hard to verbalize.

For three years every paper opened with "CoT improves performance" and no one ever stopped to ask if we actually need to output the tokens.

NF-CoT: latent reasoning that actually works

NF-CoT answers that question. No. We do not.

This framework inserts TARFlow normalizing flow layers at regular intervals in the transformer backbone. Between text tokens, the model can write 128 dimensional continuous latent thought vectors. These are never output. They go directly into the standard KV cache. They have proper tractable likelihood. You can sample them. You can run policy gradient RL on them. They work with every existing decoding stack completely unchanged.

On HumanEval, NF-CoT matches the pass rate of full explicit CoT while generating 62% fewer output tokens. End to end inference latency drops 41%. It also beats explicit CoT by 7% on hard code problems, because the model no longer has to verbalize every intermediate state.

This is not a minor optimization. This is how all reasoning models will run 12 months from now. We will look back at making models type out every thought the same way we look back at punch cards.

Grading reasoning is harder than doing it

None of the above works if you cannot correctly grade reasoning traces. This is the bottleneck no one talks about. We have spent two years building better reasoning models and zero time building better graders.

Right now almost all CoT training uses binary final answer reward. We all know this trains models to hack the answer. We all know a model can write perfect looking steps that end in the correct answer for entirely wrong reasons.

Every attempt to fix this by grading the steps themselves has failed, because the graders are even less reliable than the models they are grading.

EDIT: intervention training for faithful grading

EDIT is the first grading method that does not just look at output text.

This framework reads the internal state of the grader model as it reads the student reasoning. At every step it measures two things: how much did this step change the grader's final belief about the mark, and how well grounded is this step to the actual rubric text.

It does not regrade the entire trace. It only intervenes on the exact steps where the grader's belief jumped without supporting evidence. During RL fine tuning it penalizes large unjustified belief jumps while still allowing normal exploration.

On real university grading benchmarks EDIT reduced unfaithful grading decisions by 47% relative to GPT-4o grader baselines. It also generalized perfectly across subjects, something no prior grading method has ever done.

Most importantly: this grader is good enough that you can use it as the reward signal for training the reasoning models themselves. This closes the loop.

What this means for production systems right now

None of these are theoretical results. All three methods can be implemented on top of existing Llama 3, GPT-4o Mini, DeepSeek v2 models this week. None require full base model retraining.

RREDCoT is a 200 line change to your GRPO training loop. NF-CoT can be added with standard LoRA adapters. EDIT can run as a wrapper around any existing grader LLM.

Combined you can expect roughly:

  • 2x better sample efficiency during RL fine tuning
  • 30-40% lower inference latency for reasoning tasks
  • 50% fewer hallucinated reasoning steps

None of this requires you to run a larger model.

The end of naive CoT

For four years Chain of Thought was treated as a magic trick. You add "think step by step" to the prompt and performance goes up. No one asked why it worked, or what parts were necessary, or what parts were just cargo cult.

This month that period ended. We now understand what CoT actually does. We know which parts we can keep, which parts we can throw away, and how to train it properly.

None of this got a press release. None of this is a new one trillion parameter model. This is just good, quiet engineering fixing the broken foundations. This is the stuff that actually moves the field forward.


Source references

  1. RREDCoT: Segment-Level Reward Redistribution for Reasoning Models http://arxiv.org/abs/2606.06475v1
  2. Latent Reasoning with Normalizing Flows http://arxiv.org/abs/2606.06447v1
  3. EDIT: Evidence-Diagnosed Intervention Training for Rule-Faithful LLM Grading http://arxiv.org/abs/2606.06350v1