Skip to content

Five Production LLM Optimizations That Actually Matter Right Now

#llm-inference #lora #kv-cache #transformer-architecture #diffusion-llm

Every single week 30+ LLM papers hit arxiv. 95% of them will never run on a production cluster. This week was different. Six papers landed that all solve problems people are actually fighting right now on deployed workloads. None of them require retraining a base model. None require custom hardware. All have working code.

The quiet shift in inference optimization

For three years everyone chased the same wins: better weight quantization, continuous batching, speculative decoding. All the low hanging fruit there is gone. We are now in the second order optimization phase. Every one of these works attacks a hidden assumption everyone stopped questioning.

None of them will 10x your throughput tomorrow. All of them will stack. Combined they will double effective capacity on existing hardware before the end of the year.

KVarN: KV cache quantization that does not trade speed for memory

This is the one everyone will be testing by Monday.

Until this week every KV cache quantization method worked the same way: you compressed the cache to save memory, then you paid the cost of dequantizing every single value every single attention step. Everyone accepted this tradeoff. TurboQuant pushed this to the limit, and everyone celebrated until people ran real workloads and found it slowed down inference by 35% at equal batch size, and collapsed reasoning quality past 3x compression.

KVarN breaks this tradeoff completely.

It quantizes values per normalised residual norm bucket, not per token or per layer. This means quantization error lands almost entirely in directions that attention will ignore anyway. There is no per-step dequantization penalty. The quantized values are used directly in the attention matmul.

Hard numbers: 3-5x KV cache capacity. 1.4x higher throughput than raw FP16. Less than 1 point drop on AIME25 at 4x compression. TurboQuant at the same compression level drops 21 points and runs at 42% the throughput.

It drops into vLLM with one flag. No calibration. No fine tuning. Works on every existing decoder model.

This is not an incremental improvement. This replaces FP8 as the default KV cache mode for almost all workloads. It will take two weeks for every inference provider to roll this out.

TaDA: Stop merging LoRAs wrong

Everyone merges LoRAs. Nobody does it correctly.

Every existing merging method treats all adapters the same. They apply a single global weight, average weights across every layer, and hope for the best. This works acceptably for two similar task adapters. It falls apart completely when you try to combine a domain adapter and a task adapter.

The authors found something extremely simple that nobody noticed: domain signals live almost entirely in the upper half of transformer layers. Task signals live almost entirely in the lower half.

You do not need to train anything to exploit this. You do not need any data. You just gate the merge weight per layer: 90% task adapter weight in layer 2, 10% domain. 10% task, 90% domain in layer 30. That is 90% of the gain.

TaDA adds a small subspace cleanup step to remove conflicting singular directions, and outputs a standard single rank-r LoRA. Zero inference overhead.

On Llama 2 7B scientific QA, TaDA beats DARE-TIES by 3.6 percentage points average across six benchmarks. It wins every single one.

If you are serving more than three LoRA adapters today, you should replace your merging code with this before the end of the week. There is no downside.

Depth-Attention: Cross layer flow for zero extra cost

Every transformer works the same way: each layer writes to the residual stream, and later layers can only read the final sum. No layer ever gets to go back and select an earlier representation. It just gets the blended average.

Every previous attempt to fix this added extra state to inference. None of them shipped, because any change that increases KV cache size is dead on arrival for production.

Depth-Attention fixes this without adding anything.

Before running self attention, each layer runs one tiny extra attention operation: it uses the same query, attends over the values from all earlier layers at this exact token position, mixes them, and writes the result back into the value cache slot. That is it.

No extra parameters. No extra state. No change to KV cache size. 0.01% extra FLOPs.

On Qwen3 3B it improves average downstream accuracy by 2.3 points. It beats every existing cross layer baseline, and runs faster than all of them.

This is a drop in replacement for the standard decoder block. This will be in every new LLM architecture released from this point forward.

Fast function vectors

Function vectors are the most underused production steering mechanism right now. You extract a single vector from a few demonstration examples, add it to the residual stream, and the model will execute that task reliably for every subsequent generation. No prompt overhead. No fine tuning.

The original implementation was extremely inefficient. It averaged activations across every attention head. 90% of those heads contributed nothing, and 10% were actively harmful.

This work uses Layer-wise Relevance Propagation to select only the 5-10% of heads that actually carry the task signal. The resulting function vectors are 10x smaller, 15% more accurate, and can be computed in one forward pass instead of ten.

You can now precompute function vectors for common operations, store them as 2kb blobs, and inject them at runtime for zero overhead task steering. This beats prompt engineering for every fixed production task.

SAID: 9x speedup for diffusion LLMs

Diffusion LLMs are not a toy any more. LLaDA already matches Llama 3 accuracy on most benchmarks, and can generate entire paragraphs in parallel. The only remaining problem was that they required 32 denoising steps, so end to end latency was still worse than autoregressive models.

SAID fixes this.

It does not run the same number of steps for every token. It first runs full denoising on 20% of scaffold tokens that define the sentence structure. It then runs 2 steps on all remaining tokens. That is all.

Maximum measured speedup is 9.1x over standard iterative decoding. Quality loss is within measurement error across math, coding and knowledge benchmarks.

This brings diffusion LLM latency well below autoregressive latency for long form generation. This is the turning point.

STaR-Quant: Quantizing diffusion LLMs correctly

Regular quantization does not work on diffusion LLMs. Everyone found this out the hard way over the last two months.

There are two failure modes. First, masked and unmasked tokens have completely different activation distributions within every step. Static quantization scales are wrong for half the tokens every single pass. Second, quantization error accumulates across every denoising step. A 1% error on step 1 becomes a 30% error on step 24.

STaR-Quant fixes both. It splits activation transforms for masked and unmasked tokens, and adds a tiny per block compensation term that cancels accumulated error across steps.

It delivers 3.14x memory reduction and 1.69x speedup over FP16, with less than 0.5% accuracy drop. This makes 8B diffusion LLMs run well on 16GB consumer cards.

What stacks, what does not

None of these optimizations conflict. All of them compose.

You can run a model with Depth-Attention, merged with TaDA LoRAs, steered with function vectors, quantized with KVarN. All gains add. On paper that combination will deliver ~4x higher effective throughput on identical hardware compared to the standard stack that everyone was running 30 days ago.

Not everything works for every workload.

KVarN is a no brainer for everyone. Test it first. TaDA is only relevant if you merge LoRAs. If you do, it is free gains. Depth-Attention requires modifying the model graph. You will only get this when base model authors adopt it. Function vectors work for fixed task workloads. They do nothing for open chat. Diffusion LLM optimizations only matter if you are already testing LLaDA. Everyone else can wait three months.

What comes next

We are no longer waiting for bigger models. We are now in the phase where every month we get 20-30% more performance out of the exact same weights we already have.

This will continue for at least another two years. There are still dozens of unexamined assumptions baked into every standard transformer implementation. Every one of them is a 10% gain waiting to be found.

Most of these papers will not get press releases. None of them will have twitter threads with 100k likes. All of them will be running on every LLM endpoint you use before the end of the summer.