Everything We Knew About RL For LLMs Changed Last Week

Every six months or so, a batch of papers lands that makes every production RLHF pipeline running right now look obsolete. That happened last week. Seven independent works all dropped within 48 hours on arXiv, each attacking a different broken foundation of modern LLM reinforcement learning. None of them are minor tweaks. None of them add another knob to tune. They collectively throw out almost every assumption the field has operated on for the last three years.

Static reward models are dead

For four years we operated on one unchallenged rule: you train a reward model once on human preferences, then you run RL against it. This was always a bad idea. Human preferences are not static. They do not compress into a single 4096 dimensional vector. They shift by domain, by user, by context, by how tired the human was when they clicked the rating button.

In-Context Reward Adaptation kills this model entirely. Instead of training a fixed reward function, the paper demonstrates you can use the LLM's own in-context learning capability to infer reward structure on the fly from as few as 8 preference examples.

There is a catch. A vanilla transformer cannot do this reliably. All attempts will asymptotically bias away from ground truth preference. The authors found one trivial, almost embarrassing fix that makes it work: include the human response time as an input token. That single additional signal eliminates the bias. It works across completely unseen preference domains with no retraining. No fine tuning. No new weights. Just prompt the model for preference judgement with demonstration examples and response times.

This is not an improvement. This is a replacement. There is no good reason to train a standalone static reward model ever again.

Rubric rewards work, if you stop cheating

Everyone knew rubric scoring was better than single number ratings. No one could get it to not get exploited during RL. Models will reliably find every edge case, every wording loophole, every way to get a high rubric score without doing the actual thing you asked.

RLR³ fixes this. The paper does two extremely obvious things that no one bothered to implement correctly before.

First, it splits every rubric line item into either verifiable or judgement. Verifiable criteria get extracted by an LLM then run through a deterministic checker. The LLM doing extraction never sees the ground truth. Judgement criteria are scored by a separate LLM that never sees the reference answer.

Second, it does not sum scores. It hierarchically aggregates. Hard requirements gate all optional criteria. If you miss one mandatory line item, you get zero for the whole response, no matter how well you did on everything else.

On Qwen3-VL-30B this delivered a 4.7 point improvement over standard RLVR across 15 benchmarks. That is the entire gap between the base model and the official instruct fine tune. Audits found this setup reduced exploitable false positive rewards by 89%.

We have been running RL with broken reward accounting for years. This paper just did the accounting correctly.

RL does not create representations. It recruits them.

This is the most important paper in the batch, and the one that will break most people's mental model of how alignment works.

We have always assumed that when you run RL on an LLM, you are writing new representations into the model. You are teaching it what good and bad outputs look like. That is wrong.

The authors trained models on a simple neutral maze task, then extracted the concept vectors associated with reward and punishment. These vectors did not just work in the maze. They worked everywhere.

The punishment vector will induce refusal, uncertainty, pathological backtracking, negative self report, and reduced goal achievement across every task you test. The reward vector does the opposite. They are almost perfectly antiparallel.

Most critically: these vectors exist in the raw pretrained model. Before any RL. Before any maze training. Before any fine tuning at all. RL does not create this axis. It finds it, and hooks the reward signal up to it.

This is not a minor observation. This changes every single thing we thought about alignment safety, interpretability, and the long term effects of RL training. All of our alignment work is just flipping switches that were already built into the model during pretraining.

GRPO was a beta. We already have its replacement.

Six months ago everyone was celebrating GRPO as the end of PPO. Everyone migrated their pipelines over. We already knew it had failure modes. No one knew how trivial they were to fix.

Hysteretic Policy Optimization is a 3 line change to GRPO. It reduces the weight of negative advantage updates, and replaces per response length normalization with batch mean normalization. That is it.

On the sparse reward TeleLogs benchmark it beats GRPO by 15%. It beats SAPO by 5% and GSPO by 11%. All gains come in the early training phase where reward is still sparse, exactly the region where every production RL run gets stuck or collapses. There is no downside. No additional compute. No new hyperparameters that need tuning.

GRPO had a good run. It is obsolete.

Chaos is the default operating regime

Almost all RL theory is built on the assumption that trajectories are approximately stable. Small changes in action produce small changes in outcome.

This is never true for LLMs. Language output is a chaotic system. Exponential sensitivity to initial conditions is not an edge case. It is how the system works at every step.

Standard RL averages over diverging trajectories and produces garbage gradients. Distributional RL does not. The paper proves that while individual trajectories diverge exponentially, the return distribution evolves smoothly under 1-Wasserstein distance. This is not an empirical observation. This is a mathematical result that holds for all chaotic systems meeting very mild stability conditions.

Every policy optimizer we use today is built for a world that does not exist. We have been running optimization algorithms designed for stable systems on chaotic ones. That is why training is so unstable. That is why results vary so wildly between runs.

This is not incremental progress

None of these papers are fighting over 0.2% on MMLU. None of them are proposing another 10x larger model.

They are fixing the foundations. Over the last three years we built a working RLHF stack out of duct tape, empirical hacks, and things that just happened to work for no good reason. We have now reached the point where people are going back, figuring out why things actually work, and replacing every part of the stack one by one.

This shift is happening faster than almost anyone expected. Twelve months ago we were still arguing about whether PPO was necessary. Six months ago GRPO landed. Today we already have its replacement, and we are throwing out the entire reward model layer entirely.

If you are running a production LLM alignment pipeline today, every single component you are using will be obsolete within six months. That is not a complaint. That is the fastest rate of progress this field has ever seen.

Everything We Knew About RL For LLMs Changed Last Week

Static reward models are dead ​

Rubric rewards work, if you stop cheating ​

RL does not create representations. It recruits them. ​

GRPO was a beta. We already have its replacement. ​