Appearance
Last week four separate research groups dropped RLVR papers on arXiv within 48 hours of each other.
None introduced a new model. None used more training data. None scaled up batch size or step count.
Every single one arrived at the same conclusion, from completely different angles: we have been running RLVR fine tuning wrong. At minimum 75% of the compute we throw at this process today produces no performance gain. Most of it actively harms final model quality.
This is not incremental improvement. This is a full reset of how we will fine tune reasoning models going forward.
RLVR became the standard overnight
Six months ago GRPO landed and ate the entire LLM reasoning field. Within 90 days every major lab had abandoned supervised fine tuning, DPO and every other approach for mathematical and coding models. RLVR with verifiable binary rewards was the only thing that moved the needle on hard benchmarks.
Everyone copied the recipe: generate 8 rollouts per prompt, run group relative advantage, update policy, repeat for 1000-5000 steps. No one asked if this was efficient. No one checked what the parameters were actually doing during training. We just ran the loop and watched MATH scores go up.
All four papers this week poked at that black box. None of them liked what they found.
The entire RLVR trajectory is rank 1
RELEX is the most shocking result of the four. The authors recorded full parameter checkpoints every step during standard RLVR runs across Qwen2.5-Math-1.5B, Qwen3-4B and Qwen3-8B. They then ran PCA on the sequence of weight deltas.
92% of the total variance in parameter updates was captured by the first principal component. 96% was captured by the first two.
For all practical purposes, during RLVR training the entire model moves along a single straight line in parameter space. Every step just moves it a little further along that same line. All other movement is uncorrelated optimization noise.
Once you know this, you do not need to run 1000 training steps. You run 50 steps. Fit a linear regression on the observed movement along that rank 1 direction. Extrapolate out as far as you want.
RELEX extrapolated checkpoints 20x past the end of the observed training window. Those extrapolated checkpoints matched or outperformed the actually trained checkpoints on every benchmark. They did this with 15% of the original compute.
Increasing the subspace rank did not help. Non linear extrapolation made results worse. All the extra training steps were doing was walking forward along the line, while adding noise that had to be filtered out later.
Standard credit assignment was optimizing for newlines
DelTA started from a separate observation. No one could explain why sequence level rewards produced such consistent improvements. No one had looked at which tokens actually received gradient updates.
When you average gradient vectors across high and low reward sequences, high frequency shared tokens completely dominate the resulting update direction. Formatting tokens. Newlines. Equals signs. The word "therefore".
These tokens appear in every response. They have very large gradients. They end up accounting for over 80% of the total update magnitude. The actual reasoning tokens that differ between correct and incorrect answers contribute almost nothing to the standard RLVR update.
DelTA does one simple thing. Before calculating the update direction, it reweights every token gradient by how well that token actually discriminates between high and low reward sequences. Common uninformative tokens get downweighted almost to zero. Rare diagnostic tokens get amplified.
This change alone produces an average 2.9 point gain across all seven tested math benchmarks. It requires no additional samples, no additional training steps, no changes to the rest of the training loop. This is a drop in replacement that you can implement in an afternoon.
You only need 150 online steps
G2D attacked the largest running cost of RLVR: online rollouts.
Conventional wisdom said online GRPO would always beat offline DPO. This was correct, but only when people generated the preference dataset from the cold base SFT model.
The G2D authors ran a simple experiment. They ran GRPO for K steps, dumped all generated rollouts to a static dataset, then ran DPO once on that dataset. They then compared the final result against running GRPO for the full 1000 steps.
At K=150, DPO on the static dataset beat full GRPO by 10.8 percentage points on MATH-500. It used 4x less total compute.
Performance peaks at 100-200 online steps. Running more steps makes the final model worse. After 200 steps the policy becomes overconfident. It stops generating plausible wrong answers. Without those wrong answers there is no contrastive signal left in the dataset. You can run another 800 steps and get nothing for it.
The online offline gap was never about the training algorithm. It was about the quality of the data you feed it.
Pairwise advantages beat scalar ranks
LamPO fixed the last broken part of the standard GRPO implementation.
Standard GRPO takes 8 rollouts, sorts them by reward, assigns each a scalar advantage value relative to the group average. This throws away almost all the information present in the group of responses.
There are 28 pairwise comparisons you can make between 8 responses. LamPO uses all of them. Instead of one scalar per sequence, it calculates a decomposed advantage for every pair, weighted by the relative confidence of the model in each response.
This change produces more stable training, lower gradient variance, and consistent 1-2 point gains across every tested model and benchmark. It adds zero overhead. It uses exactly the same samples as standard GRPO.
What this means for your production pipeline tomorrow
None of these results are theoretical. All have working reference implementations. All reproduce across every common open source base model.
If you are running RLVR fine tuning today you can make these changes this week:
- Do not run more than 200 online GRPO steps. There is no value past this point.
- At 50 steps extract the weight delta vector. Extrapolate the rest of the run with RELEX. You will have your final model before the original run would have finished 10% of its steps.
- Replace the default advantage calculation with DelTA token weighting. This is a free 3 point gain.
- Or even simpler: stop at 150 steps, export all rollouts, run one DPO pass. You will beat running GRPO for 1000 steps at a quarter the cost.
- Swap GRPO for LamPO when you next update your training stack.
The quiet implication
All four papers are pointing at the same uncomfortable truth.
RLVR does not teach the model new reasoning. It does not discover new capabilities. It does not explore parameter space.
RLVR is just aligning the model output along one single direction that already existed fully formed inside the base model. All the expensive distributed training loop we built is just a very slow, very noisy, very inefficient way to do linear regression along that direction.
We spent a year running thousands of A100 hours to do something we could have done with 50 observations and a line fit. That is embarrassing. It is also the best news this field has had in 12 months.