Appearance
This is not an announcement of a new model. This is not a paper about a new architecture that will change everything in two years. This is about what happened last week.
Right now, you can run a 35 billion parameter mixture of experts model on a 16GB consumer GPU, at 100 tokens per second. Six months ago this required a 24GB card. Twelve months ago it required 48GB.
Nobody made a big announcement. There was no press release. Four papers dropped on arXiv the same day. Three people posted benchmarks on Reddit. And the entire practical frontier for LLM deployment moved 70% forward.
We stopped chasing magic architectures
For three years almost all optimization work followed the same pattern. Someone would propose a new transformer variant. Everyone would run benchmarks for six months. Everyone would conclude it was 5% faster but broke fine tuning or long context and went back to vanilla transformers.
That ended this month.
Nobody is proposing new attention mechanisms anymore. Nobody is arguing about RoPE variants. Every single improvement released in the last 30 days operates on layers everyone already agreed were good enough. They just fixed the parts that everyone had stopped looking at.
Quantization is no longer about rounding better
For two years every quantization paper did the exact same thing. They would come up with a slightly smarter way to round weights. They would show 0.1 better perplexity at 4 bit. Everyone would clap and then go back to using AWQ.
This was always a dead end. The hard problem was never weights. It was activations.
Activation outliers break every low bit quantization scheme. Everyone knew this. Everyone also accepted that you could not go below 8 bit for activations without destroying model quality. That assumption died on June 12th.
OffQ: The outlier problem is finally solved properly
OffQ is not another rounding trick. It does not try to make outliers fit into the quantization grid. It moves them out entirely.
The method is brutally simple. Run top-1 PCA on activation distributions. Rotate the activation space so all outlier magnitude lands on exactly one channel. Then absorb that entire channel into a single shared offset. The remaining activations have standard deviation reduced by 62% on average.
This works. Across Llama 3, Qwen 3 and Gemma 4, OffQ delivers W4A4KV4 uniform quantization with less than 1% perplexity degradation. No grouped quantization. No per token scaling. Just plain uniform 4 bit everywhere.
This is not an incremental improvement. This removes the single largest remaining barrier to full 4 bit inference. Every inference engine will implement this before the end of the quarter.
SigmaScale: Low rank compression stopped being a meme
Everyone has known for five years that transformer weight matrices are extremely low rank. Everyone has also known that every attempt to actually compress them with SVD destroyed model quality.
SigmaScale fixes this with one trivial change that nobody tried for three years. Instead of running SVD on the raw weight matrix, first learn two diagonal scaling matrices. Optimize them directly against activation error before truncation.
That is the entire trick.
This reduces effective intrinsic rank by 38% on average for Llama 3.1 layers. You can remove 50% of the parameters from feed forward networks and lose less than 0.3 perplexity. No fine tuning required. No LoRA. Just run this once on a trained model.
It is not perfect. It breaks on attention output projections. But for 70% of the parameters in every modern LLM, this works right now.
MoE sparsity went one level deeper
Everyone talks about MoE sparsity at the expert level. You activate 2 out of 8 experts per token. Nobody asked what was inside those experts.
The SGATLin paper asked exactly that. They found you can replace every single feed forward expert with individual linear neurons. No non linearity. No dense layers. Just gate 8 out of 256 single neurons per token.
At identical FLOP budget, this delivers better perplexity than standard dense MoE feed forward layers.
This is not a small result. This means we have been running 10x more compute than required inside every MoE FFN for the last three years. It also means every activated neuron does exactly one semantic thing. You can read them. You can edit them. You can delete the ones that hallucinate.
We were wasting 40% of decode time copying KV cache entries
On June 16th, ggerganov merged a 12 line patch into llama.cpp.
That patch removed a redundant copy of KV cache cells that ran once per token. On Gemma 4 12B, decode throughput went from 104 tok/s to 149 tok/s. That is a 43% speedup. No changes to the model. No changes to quantization. Just stopped copying memory for no reason.
Nobody noticed this for three years.
This is the state of our industry. We had thousands of papers written about attention efficiency. We had hundreds of people arguing about theoretical space bounds for streaming attention. And the single largest performance bottleneck in every production deployment was a stray memcpy that nobody bothered to profile.
The streaming attention bounds paper released the same day is good work. It proves we can compress KV cache by another 4x with bounded error. But it will not matter for most users until we stop wasting half our GPU time copying memory.
Luce Spark: Expert offloading finally does not suck
Everyone knew you could offload cold MoE experts to system RAM. Everyone also knew it destroyed throughput. Naive offload would run at 55% of native speed at best. Everyone accepted this as a fundamental tradeoff.
Luce Spark fixes this.
It does three very boring things. First it counts which experts actually get used on your traffic. It keeps the top 8% on GPU. Second it runs an async ring cache for cold expert loads, overlapped with attention compute. Third it fuses the entire decode pass into one graph instead of 40 separate kernel submissions.
That is all.
The result: 35B Qwen 3.6 MoE runs at 100 tok/s on 16GB VRAM. That is 85% of the speed you get running the entire model on a 24GB card. No quality loss. No approximation. Bit identical output.
This is not research. This is production code you can run today.
QAT changed the quantization baseline
Google dropped QAT quantized Gemma 4 models two weeks ago. Nobody has properly digested what this means.
The official Google Q4_0 quantized model outperforms every third party Q4_K quant. It even outperforms most Q5 quantizations. At the same time it is 20% smaller.
This is not a trick. This is what happens when you quantize during training instead of after. All the previous quantization work was trying to fix damage caused by quantizing a model that was never trained to tolerate it.
We are not going back. Every new model released from this point will ship with native QAT quantizations. Post training quantization will be obsolete within 12 months.
What this means for deployment right now
If you are running LLM inference today, you can go and implement all of this before the end of next month.
You will get 2.2x lower memory usage. You will get 1.8x higher throughput. You will see no measurable quality loss.
None of this requires retraining your model. None of this requires changing your API. All of this works on every existing transformer model released in the last two years.
The quiet end of the big GPU moat
Six months ago the common wisdom was that you needed an H100 to run production grade inference at scale. Three months ago it was an RTX 4090. Today it is a 16GB consumer card that costs $350.
This shift did not happen because someone built a better GPU. It happened because we finally stopped accepting all the stupid waste that had accumulated in the stack over the last three years.
There will be more gains. There are still obvious stupid things we are still doing. But this was the big one. The moat is gone. Anyone can run good LLMs at scale now.