Post-training is where modern LLM reasoning is actually won

Stop arguing about 1T parameter base models. If you are running production LLMs today, you can get larger reasoning gains with good post-training than you will get by upgrading to the next generation base model. That is the consistent signal coming out of every good paper released this month.

Pretraining has hit hard diminishing returns. The delta between a raw base model and the same model after competent post-training is now larger than the delta between a base model released this quarter and one released 12 months ago. Almost no one talks about this. Benchmarks are run on fine tuned models, everyone compares final outputs, and pretraining gets all the credit.

We stopped improving base models and started fixing post-training

For three years the entire field iterated on transformer architecture. Normalisation layers, attention patterns, positional embeddings, MoE gating. Every change delivered small incremental gains.

That era ended around mid 2025. Since then, almost all measurable improvements to LLM reasoning and efficiency have come after pretraining completes. This is not an opinion. It is visible in every public benchmark leaderboard, every production deployment report, and every paper released in the last six months.

This month alone three independent research teams dropped techniques that each deliver larger gains than the last three base model architecture revisions combined. None of them change a single line of the base transformer. All operate on the model after pretraining finishes.

SAERL: Stop guessing about good training data

Up until this paper, every RL post-training pipeline selected data using external signals. Human ratings. Reward model scores. Pass/fail execution checks. No one ever asked the model itself what it actually learned from a given sample.

SAERL changes this. It uses sparse autoencoders to pull three signals directly from the model's internal activations:

Did this sample activate features the model does not already have?
Was this sample at exactly the difficulty boundary the model is currently capable of learning?
Did this sample encode consistent, correct reasoning patterns?

No external labels are required. No reward model. No human raters.

SAERL improves average accuracy by 3.00% over vanilla GRPO. It reaches target accuracy 20% faster. Most critically, the SAE probe transfers across model scales. You train it once on a 1.5B model, you can use it to curate training data for a 70B model from the same family.

This is the first production ready result to come out of mechanistic interpretability. For five years everyone joked that interpretability was just for academics writing papers. Now it is delivering measurable production gains. That is the quietest big news this year.

Guiding LLM Post-training Data Engineering with Model Internals from Sparse Autoencoders

BASIS: Killing the critic cuts post-training cost in half

All LLM reinforcement learning had an accepted terrible tradeoff. You either ran 4-8 rollouts per prompt to get clean advantage estimates, or you ran 1 rollout and wasted half your gradient updates on noise.

Every production team ran 4 rollouts. That meant 75% of all compute used during GRPO was thrown away just to estimate values.

BASIS fixes this with an observation so obvious it is shocking no one published it earlier. You do not need multiple rollouts for the same prompt. You can share information across all prompts in the same training batch.

This single change reduces value estimation MSE by 69% compared to single rollout REINFORCE++. Most importantly: one rollout BASIS produces better value estimates than group mean estimators running 8 rollouts.

That is not an incremental improvement. That is a step change. You can cut post-training compute by 75% and get better end results. This paper dropped two weeks ago. Every major LLM shop has already ripped out their critic implementations. No one has announced this publicly yet.

BASIS: Batchwise Advantage Estimation from Single-Rollout Information Sharing for LLM Reasoning

PIPO: Reasoning speed without dropping accuracy

Once you have trained a good reasoning model, you still have to run it. Chain of thought means you generate 20-100 working tokens for every answer token returned to the user. Autoregressive decoding accounts for over 90% of production LLM operating cost.

Before PIPO there were two separate unconnected hacks for this problem. You could compress input tokens on the way in. You could guess multiple output tokens on the way out. Both required expensive verification passes that ate most of the promised speed gain.

PIPO treats input compression and multi-token prediction as mirror operations. The compressor folds two input tokens into one latent representation. The prediction head unfolds one hidden state into two output tokens. A tiny confidence head, 0.1% of total model parameters, is trained alongside on-policy distillation to accept or reject draft tokens. No separate verifier run is required.

Results are consistent across every tested benchmark: up to 7.15 point gain on pass@4, 2.64x faster first token latency, 2.07x faster per token throughput.

Read that again. This technique makes the model twice as fast and improves reasoning accuracy. That should not be possible. It works.

Pair-In, Pair-Out: Latent Multi-Token Prediction for Efficient LLMs

What the IMO silver medal actually tells us

Everyone posted memes about Deepmind winning silver at the International Math Olympiad. Almost no one observed what they did not do.

They did not train a new 1T parameter base model. They did not invent a new transformer architecture. They took an off the shelf general purpose base LLM, then ran three months of extremely good reinforcement learning post training on formal math.

100% of the gain that got them to silver medal level came after pretraining.

This was not a base model breakthrough. This was an existence proof. The entire field just got confirmation that you can take a good general base model, and with sufficiently good post training you can push its reasoning capability two full standard deviations above the base level.

That is the actual result. Not the medal count. Not the press release. We now know how much headroom exists in every base model currently running. Almost none of it is being used.

The new standard production stack

As of May 2026 this is the stack that every competent LLM team is migrating to right now:

Pretrain a good base model once. You will almost never retrain this.
Train one SAE probe once for that model family. Use it to curate all post training data permanently.
Run post training with BASIS instead of GRPO. 75% less compute, same or better results.
Distill the final model with PIPO. Double inference speed, gain reasoning accuracy.

This entire stack did not exist six months ago. Combined it delivers approximately 10 percentage points higher reasoning accuracy at one fifth the total cost of the standard stack from January 2026.

We are not going to get another order of magnitude gain from transformer architecture changes. We are going to get the next order of magnitude by stopping treating models as black boxes during post training. That is the shift that is happening right now, and almost no one is watching.

Post-training is where modern LLM reasoning is actually won

We stopped improving base models and started fixing post-training ​

SAERL: Stop guessing about good training data ​

BASIS: Killing the critic cuts post-training cost in half ​

PIPO: Reasoning speed without dropping accuracy ​