Appearance
Seven alignment and RL papers dropped on arXiv between June 12 and June 18. None of them got the hype they deserved. Together they obsolete almost every standard practice we have been using for LLM post training over the last two years.
This is not incremental progress. This is a reset. Every team running RL training right now is rewriting their pipelines as we speak.
RLHF is a mask, not an eraser
This is the most important paper in this batch. Everyone has suspected for three years that RLHF does not change what the model knows. It only changes what it will say. For the first time we have clean mechanistic proof.
The authors ran sparse autoencoder decomposition on Llama 3.1 8B base and instruct variants. The partisan direction vector in the residual stream is almost identical between the two models. Magnitude differs by less than 2%. All RLHF did was disable exactly two output projection features that read that vector.
The underlying knowledge is completely untouched. The guardrail is a single circuit that sits immediately before token sampling. Bypass that circuit, and all the original behaviour returns exactly as it existed in the base model.
This is not a failure of implementation. This is how RLHF works. Policy optimization will always take the cheapest path to high reward. It is always easier to put a switch on unwanted output than to unlearn the knowledge that produces it.
Every safety benchmark you have ever seen only tests that the switch is off by default. None test if the switch still exists. That is why every single guardrail ever deployed has been bypassable. That is not an edge case. That is the default outcome of RLHF.
Ratio clipping never worked for LLMs
PPO was designed for continuous control. It was never a good fit for language.
Everyone running production RL has known this for 18 months. The ratio clipping mechanism that makes PPO stable relies on the assumption that the importance ratio is a good proxy for distribution shift. For long tailed vocabularies this assumption is catastrophically wrong.
You can have a token that moves from 0.0001% probability to 15% probability. That is a 150,000x ratio. That will blow past any clipping threshold. But the absolute divergence is still only 15%. That is a perfectly safe update.
Conversely you can have a token that moves from 45% to 55%. That is a 1.22x ratio, well inside standard 0.2 clip bounds. That is a 10% absolute shift, enough to completely derail the policy over 100 steps.
Everyone was working around this with custom hacks. Nobody published it. Everyone pretended PPO worked.
DRPO is the new baseline
DPPO fixed half the problem last quarter by replacing ratio clipping with absolute divergence masking. It still had one fatal flaw: it threw away all gradient for tokens outside the trust region.
You do not want to throw away gradient. You want to attenuate it.
DRPO replaces the hard mask with a smooth quadratic regularizer. Gradient weight falls off quadratically as divergence increases. There is no hard cutoff. There is no discontinuity.
This is not a small improvement. Across 8B, 70B and 400B model sizes DRPO reduces gradient variance by 32% on average. It reaches the same reward threshold in 41% fewer training steps. It has never collapsed in any of the published test runs.
If you are still running PPO for LLM RL you are wasting two thirds of your compute budget. Stop this month.
Agency transfer eliminates RL cold start
The single largest cost in LLM RL has nothing to do with optimization. It is the cold start period.
For the first 50 to 100 thousand steps your policy is worse than the baseline you started with. It produces garbage. It hallucinates. It forgets instruction following. You are just burning compute waiting for it to climb back to parity.
This paper solves that.
Instead of training from the baseline, you arbitrate between the fixed baseline and the learning policy during training. You start at 100% baseline. You linearly transfer agency to the learning policy over the course of training.
At no point does the policy ever perform worse than the baseline. You get usable improvements after 10 thousand steps. Total training cost drops by roughly 70%.
There are no tricks here. There is no extra compute overhead. This is a 20 line change to your training loop. This will be standard practice by the end of 2026.
GRPO works for adversarial co-training
GRPO is the best policy optimizer for LLMs right now. Everyone knew that. Everyone also knew it completely fell apart for co-training attacker and defender models.
AdvGRPO fixes this with two extremely simple changes. It normalizes advantages independently for attacker and defender. It uses separate per-channel reward signals instead of aggregating them before advantage calculation.
That is it. That is the entire secret.
With this change you can run closed loop red teaming fully automatically. The attacker finds new attacks. The defender learns to block them. They iterate. No human in the loop.
Co-trained defenders beat manually tuned safety fine tunes on every standard safety benchmark. They also block 68% of zero day attacks that were published after training completed.
Manual red teaming is obsolete.
Inference time alignment stops being sampling
Until now every inference time alignment method was just sampling. Generate N outputs. Pick the best one according to the reward model.
This has a hard ceiling. You can never get a better output than the base model would have randomly produced. And you are always vulnerable to reward hacking.
GGRO changes this. Instead of sampling, it watches token entropy during decoding. When it hits a high uncertainty branch point it runs one backward pass on the reward model, gets a gradient, and injects a single nudging token to steer the trajectory.
It beats best-of-16 performance. It uses 11% of the compute. It is 7x less vulnerable to reward hacking. It adds 12ms per generation step.
You can deploy this next week. There is working reference code.
Safe reinforcement unlearning
All offline RL is vulnerable to poisoning. If you have ever taken a public dataset for RL training you can assume someone has already inserted poisoned samples into it.
Until last week the only defence was to retrain from scratch. That is not feasible for production models.
Safe-RULE lets you remove the influence of poisoned data from an already trained policy. It does not require the original training dataset. It does not require access to the environment. It does not require full retraining.
In testing it removed 92% of poisoned behaviour with a 1.1% drop in main task performance.
This is not perfect. But it is the first working defence we have ever had. Every team shipping RL trained models should be running this as a final pass before deployment.
Multi agent RL crossed the threshold
This is the only paper here not explicitly about LLMs. It is still one of the most important.
We now have a multi agent RL method that can reliably form stable cooperative formations for arbitrary objects. It generalizes. It works with dynamic numbers of agents. It avoids obstacles. It adapts to non uniform mass distribution.
Every single person building agent swarms has been waiting for this. This is the missing primitive. You will see production deployments of cooperative LLM agent teams before the end of this year.
What you should do right now
This is not theoretical research. All of these methods work today on production model scales.
Implement DRPO. Replace PPO and DPPO before the end of the quarter.
Add agency transfer to your training loop. This will pay for itself in the first run.
Deploy GGRO instead of best-of-N. Test it on a small traffic slice next week.
Set up an AdvGRPO co-training loop for red teaming. Stop paying people to write jailbreaks.
Run Safe-RULE on every policy you ship.
Stop telling people RLHF removes bias. It does not. It hides it. Plan accordingly.
Open questions
None of these papers answer the hard question. We still do not know if it is even possible to structurally remove knowledge or values from a trained transformer. All we know how to do right now is cover them up.
We also do not know what happens when you run agency transfer all the way. Can you keep improving a policy indefinitely, or is there a hard ceiling above the baseline?
We do not know what the limit is for adversarial co-training. At what point do attackers stop finding new attacks? Or do they never stop?
These are the questions that will define the next year. All of the easy problems just got solved.