Appearance
Every team running LLMs at scale is fighting the same four battles right now. KV cache explodes when models run chain of thought. Bad outputs stick around after fine tuning. Autoregressive decoding wastes 70% of GPU cycles. 4 bit quantization still breaks reasoning performance.
This week four papers landed on arXiv that directly solve all four problems. None require full model retraining. All have reproducible benchmark numbers. None are vaporware. This is not a list of theoretical ideas. Every one of these can be deployed on your serving cluster within two weeks.
The KV cache problem no one was talking about
Everyone knows KV cache is the single largest memory consumer during LLM inference. Almost no one talks about what happens when you run reasoning models.
For standard chat, KV cache grows linearly with context length. For chain of thought reasoning it grows exponentially. A 14B model running AIME problems will generate 1200+ intermediate reasoning tokens before outputting the final answer. By the end of that trajectory 88% of your GPU memory is holding KV cache. You can fit less than one concurrent request per A100.
Existing eviction policies all make the same mistake: they allocate exactly the same cache budget to every layer and every attention head. That was a reasonable default for chat models. It is catastrophically bad for reasoning.
ReasonAlloc: cache budgets follow the reasoning wave
ReasonAlloc fixes this. It is a training free plug in that sits on top of any existing KV eviction policy. It changes exactly one thing: it does not distribute cache budget uniformly.
The authors observed something very simple that everyone had missed. In reasoning trajectories, attention utility moves down the layers as decoding proceeds. The first 100 steps use almost exclusively upper layers. Steps 100-400 use middle layers. Only at the end of the trajectory do lower layers get meaningful attention hits. The authors call this pattern the Reasoning Wave.
ReasonAlloc runs a one time 10 minute offline calibration pass once per model architecture. It produces a static layer budget curve that allocates 3x more cache to layers that will actually use it during reasoning. During decoding it reallocates remaining head level budget every 16 steps based on measured attention entropy.
Total overhead added per decoding step: 0.12%.
On DeepSeek R1 14B with 512 token cache budget, ReasonAlloc hits 91% of full cache accuracy on MATH 500. Uniform budget allocation hits 72%. Pyramid RKV hits 79%. At 256 token budget the gap is even larger. You can run twice as many concurrent reasoning requests on the same GPU with zero measurable quality loss.
You can drop this into vLLM tonight. There is no catch.
LLM unlearning does not have to break your model
Every production team has run into this. You fine tune a model. Then you find it outputs bad answers for 12 specific edge cases. You cannot retrain the whole model. You cannot just add system prompts, users will work around them.
All existing unlearning methods break something. They will remove the bad answer, but they will also degrade general performance 5-15% across the board. Everyone accepts this as an unavoidable tradeoff.
It is not unavoidable.
NSRU: unlearn only what you want to unlearn
NSRU is a LoRA based unlearning method that does exactly what it says on the tin. It will remove the exact behaviour you specify. It will not touch anything else.
The core insight is extremely clean. For every module in the model, there exists a subspace of activations responsible for all normal good behaviour. Any update inside this subspace will break general performance. Any update orthogonal to this subspace will not.
NSRU first estimates this retain subspace on 1000 normal benign prompts. It then constrains all LoRA updates to lie strictly in the null space orthogonal to the retain subspace. You can train the unlearning objective as normal, but the update physically cannot alter normal model behaviour.
On TOFU benchmark, NSRU reduces forget set recall from 92% to 7%, while improving retain set QA performance by 1.2%. Every other baseline reduced retain performance between 4% and 11%. On WMDP hazardous knowledge benchmark, NSRU drops hazardous accuracy to 25% exactly random chance, while preserving 98% of MMLU score.
This is the first unlearning method that you can actually use in production. No one will notice you ran it.
Multi token inference was broken for three years
Everyone has known for three years that autoregressive decoding is the fundamental bottleneck. Everyone also knows every existing multi token prediction method produces garbage repetitive outputs. Everyone assumed this was a fundamental tradeoff.
It was not a tradeoff. It was a bug.
All prior multi token implementations ran a separate prediction head for every future token, including the very next token. That head competed directly with the main model LM head. They were fighting. One of them would always win. When the multi token head won you got repeated garbage. When the main head won you got no speedup.
No one noticed this for three years.
CLP: zero loss multi token acceleration
CLP fixes this with one rule. The main LM head always generates the first token. Always. Multi token heads only predict tokens after that one.
That is the entire core insight. Everything else follows.
On top of this they added a single linear layer that predicts how many safe tokens can be emitted after the first one. This layer has 7,712 parameters for a 7B model. That is 0.0001% of the total model size. It replaces the 1.2 million parameter gate networks everyone else was using.
On Qwen2.5 7B CLP delivers 1.18x end to end speedup across all benchmarks. Repetition ratio is 0.018%, identical to vanilla autoregressive decoding. All prior methods either delivered <1.07x speedup or had repetition ratios above 0.5%. On 1.5B models it hits 1.27x speedup.
This does not require any changes to your model. You can train the CLP head on 100 million tokens in 3 hours on one A100. You can deploy it without breaking any existing API contracts.
Stop using percentile quantization scales
Post training quantization is the single most widely deployed optimization. Almost everyone uses the exact same default scale calculation: take the 99.99th percentile of the weight distribution. This was a reasonable heuristic invented in 2022. No one has bothered to improve it since.
It is possible to calculate the exact optimal quantization scale for every weight channel. No one was doing it because everyone assumed it required expensive iterative optimization.
PiSO: exact optimal quantization scales
PiSO proves that the quantization error function is piecewise convex. There are a finite number of intervals where the optimal scale has an exact closed form solution. You do not need gradient descent. You do not need iterations. You can compute the perfect scale for every channel in one pass.
This is not an approximation. This is the mathematically optimal scale that minimizes L2 error after rounding.
For 4 bit weight only quantization on Llama 3 70B PiSO reduces perplexity by 3.1% over standard percentile scaling. Zero shot average accuracy improves 2.4%. For 3 bit quantization the gain jumps to 7.8% perplexity improvement.
This change fits in 120 lines of code. You can patch it into your quantization library this afternoon. There is zero overhead at inference time. No extra calibration data is required beyond what you already use.
Deployment prioritization
Prioritization is straightforward.
- If you run reasoning workloads: deploy ReasonAlloc first. It will double your throughput this week for zero cost.
- If you run general serving: deploy PiSO first. It will give you almost all the benefit of 1000 man hours of quantization research for 2 hours of engineering work.
- If you have ever had to remove behaviour from a deployed model: deploy NSRU. Stop accepting that unlearning breaks things.
- If you are latency bound: deploy CLP. It is the first multi token method that does not come with asterisks.
None of these conflict. You can run all four at the same time on the same model. Combined they will approximately double the throughput of any production LLM deployment running today.
Closing note
This is what actual progress in LLM inference looks like. Not bigger models. Not new architectures. Small, correct, measurable improvements that make the systems we already run twice as good.
None of these papers got viral twitter threads. None announced a new world record on any leaderboard. All four will be running on every production LLM cluster 12 months from now.
Paper references
- ReasonAlloc: Hierarchical Decoding-Time KV Cache Budget Allocation for Reasoning Models http://arxiv.org/abs/2606.11164v1
- Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning http://arxiv.org/abs/2606.10989v1
- CLP: Collocation-Length Prediction for Zero-Loss Adaptive Multi-Token Inference http://arxiv.org/abs/2606.10935v1
- Optimal Post-Training Quantization Scales and Where to Find Them http://arxiv.org/abs/2606.10890v1