Appearance
Last week five connected papers landed on arXiv that will change every production LLM training pipeline over the next six months.
This is not incremental improvement. For two full years, post-training research followed one predictable track: someone would propose a minor variant on DPO, run it on GSM8K, report a 2% gain, and everyone would spend three months arguing about whether it actually worked or just overfitted the benchmark.
That era ended. None of these five papers propose a new policy objective. None of them introduce a fancy new loss term. Every single one of them attacks an unspoken assumption that everyone had accepted as ground truth.
The state distribution thesis breaks everything
The most important paper of the batch is Post-Training is About States, Not Tokens. It is also the simplest.
The authors ran one controlled experiment that invalidates almost every folk belief about fine tuning. They took Qwen3-0.6B, ran a heavy SFT run that pushed GSM8K up 12% but erased 21% of MMLU performance. Everyone has seen this happen. Everyone blames the cross entropy loss, or catastrophic forgetting, or distribution shift.
Then they took that broken, degraded SFT model, and used it as the teacher for on-policy distillation. No extra data. No better objective. Exact same supervision signal. The only difference: the student model trained on states that it generated itself, rather than the fixed prompt-completion pairs from the original SFT dataset.
The student beat the teacher on GSM8K. It beat the teacher on TruthfulQA. It recovered all of the lost MMLU performance.
This should not be possible under every existing framework for analysing post training. The only variable changed was which prefix states the model saw during training. The form of supervision did not change. The information content of the supervision did not change.
The conclusion is brutal. Almost all of the tradeoffs we accept between capability and retention are not inherent to fine tuning. They are artifacts of training on fixed offline datasets. RL and on-policy distillation do not work better because they have better loss functions. They work better because they train the model on states that the model actually produces.
The GRPO clipping bottleneck was hiding in plain sight
Every team running GRPO at scale has hit this wall. You get to ~71% on GSM8K. Training starts oscillating. You lower the clipping threshold. Loss gets smoother. Performance stops moving entirely. No amount of tuning learning rate, KL coefficient, or batch size will get you past it.
Clipping Bottleneck explains exactly why this happens.
Hard epsilon clipping, the core mechanism carried over unchanged from 2017 PPO, discards every token that falls just outside the allowed policy ratio range. The authors found that 28-35% of all useful gradient signal lies exactly in that 0.05 band just outside the clipping threshold. These are not bad outliers. These are exactly the tokens that are improving the policy.
The fix is almost embarrassingly simple. Instead of discarding all out of bound tokens, you keep them 20% of the time when they are within 1.5x the clipping threshold. That is the entire Near-boundary Stochastic Rescue modification. Three lines of code change. No extra hyperparameters.
Across 7B, 30B dense and 141B MoE models, this change delivers a consistent 4-7% gain on all reasoning benchmarks. It eliminates training oscillation. It works on top of GRPO, DAPO and GSPO without modification.
This paper was posted four days ago. As of this morning every major LLM lab has this running in production.
RLIF is no longer a research toy
Reinforcement Learning from Internal Feedback was the most promising idea of 2025. It was also completely unusable. You could run three iterations, get nice gains, then entropy would collapse, the model would start outputting garbage, and training would die. No one had a reliable fix.
Two is better than one solves this.
The mistake everyone made was trying to build one good internal reward. The authors show you do not need one good reward. You need two bad ones that fail in completely different ways.
They use one reward for the final answer: run 8 samples, cluster them, reward answers that agree with the majority. They use a completely separate reward for every intermediate token: penalize any token where the model assigned >99% probability. That is it. No external labels. No ground truth. No human feedback.
Add the proposed KL-Cov regularizer which only acts on the 1% of tokens that are driving entropy collapse, and you can run 22 consecutive RL iterations with no collapse. You get 92% of the performance of fully supervised RLVR, at 1/10th the cost.
This is not an incremental improvement. This removes the single largest bottleneck for scaling post training. You no longer need labelled datasets to run RL. You only need prompts.
Heterogeneous collaborative RL just works
For all of GRPO's advantages it had one crippling constraint: all samples used for training had to come from the current policy. If you used samples from an older checkpoint, or a different model, training would diverge. This meant you could not share generation work across teams. Everyone running RL had to spin up their own full inference cluster.
F-TIS breaks this constraint.
The authors show that with correctly truncated and filtered importance sampling, you can safely mix samples from completely different models during GRPO training. Convergence is identical to pure on-policy training. In many tests generalization on out of distribution tasks is actually 8-12% better.
This is the first practical protocol for collaborative distributed post training. Multiple teams running different model sizes, different architectures, even different base models can pool their generated samples. Everyone gets a better model. No one has to share weights.
What you should change this week
Stop tuning your loss function. Stop arguing about forward KL vs reverse KL. Stop running 8 epochs of SFT.
Right now, this week, you can go make these changes:
- Replace hard clipping in your GRPO implementation with NSR. This will take one hour. You will get an immediate 3-6% gain.
- Cut your SFT run to one single epoch. Take that checkpoint and run three light RL iterations. You will get better task performance and far less forgetting.
- If you run distillation, run it on policy. Stop distilling from an offline dataset.
Most importantly: stop measuring only loss and benchmark scores. Start measuring what state distribution your model is actually training on. That variable has been having larger effects than every other hyperparameter combined, and almost no one was tracking it.
This batch of papers closes one chapter of LLM research and opens another. We now understand the basic mechanics of post training well enough that we are no longer fumbling in the dark testing random loss functions. The next fights will be about measurement, distribution, and scaling.