Appearance
All public discussion this month was about 2 bit quantization benchmarks. None of it mattered.
The actual progress happened in three quiet drops that almost nobody shared. This is not incremental improvement. By the end of the year you will run a 35B MOE at usable speed on a consumer laptop.
Stop chasing the next quantization trick
For three years we have treated quantization as a rounding problem. Researchers have competed to produce ever more elaborate calibration routines, per-channel scale factors, and clipping thresholds. All of them are workarounds. None address the fundamental limit.
At sub 4 bit precision, quantization fails geometrically before it fails numerically.
Power-of-Two quantization is the only approach that actually makes sense for edge hardware. It replaces every multiply operation with a single bit shift. It removes the single largest timing, power and area cost from every inference engine.
Everyone abandoned it because it produced garbage accuracy below 4 bit. Nobody stopped to ask why.
OrpQuant kills multipliers, finally
OrpQuant is the first quantization method that actually fixes the underlying problem.
At low bit depth, the exponential lattice used for power-of-two quantization has extremely poor angular resolution. When you project a 4096 dimensional residual vector onto this grid you do not just add uniform noise. You shear and fold the entire feature manifold. Semantic structure is destroyed before you even get to rounding error.
OrpQuant does not tune scale factors. It adds a second orthogonal residual projection lattice. This lattice is also implemented entirely with shift and add operations. No multipliers. No asymmetric scaling. No lookup tables.
The results are not marginal. For LLaMA-2 7B at 3 bit weights, OrpQuant hits 6.10 perplexity. That is within 1% of AWQ. AWQ still uses full width integer multipliers for every operation. OrpQuant has none.
Full model calibration takes 15 minutes. No gradient descent. No training data. Just an analytical solver.
The authors also synthesized this design on 28nm standard cells. The matrix multiply unit fits in one third the area, and closes timing at twice the clock speed of a standard quantized MAC array. This is not just a better software algorithm. This is what will be taped out into every edge NPU in 2027.
We have been quantizing the wrong architecture
This is the bigger one. The residual-free transformer paper is the most important LLM architecture paper in the last 12 months.
Every single quantization hack invented since 2023 exists to work around one self inflicted flaw: residual connections produce heavy tailed activations.
Every time you add the residual stream you are performing an unconstrained random walk in activation space. Kurtosis explodes. Distributions stop being Gaussian. Outliers appear at every layer. It does not matter how good your quantizer is. You can not compress this distribution cleanly. All smoothquant, AWQ and GPTQ are just elaborate bandages for this architectural mistake.
Residual free transformers do not have this problem. Activation kurtosis stays flat through 80 layers. You can quantize them to 3 bit with literal uniform quantization and lose less than 0.5% accuracy.
There is one tradeoff. Full precision performance is down approximately 2%. That is the entire tradeoff. 2% worse at FP16, 30% better at 3bit. Nobody building edge deployments will choose the 2%. Every foundation model released 12 months from now will be residual free.
What actually works right now on 16GB
None of this is just future research. You can run production grade LLMs on 16GB RAM today. Most people just do not test correctly.
The recent Gemma 4 test is the first real world benchmark I have seen all year. Nobody runs MMLU for work. You run workflows that will fail silently if even one token is wrong.
OpenUI does not give partial credit. It will not overlook a wrong parameter name. It will not grade you on intent. It either renders, or it does not.
Gemma 4 2B passed this test. On a 16GB laptop. No cloud. No API key. It rendered correct working UI layouts on first try. That is not a toy. That is something you can build an actual product on.
Everyone arguing about 70B benchmark scores missed this. For 90% of structured tasks people actually deploy, a good well quantized 2B model will beat a badly deployed 70B model every single time.
The 30% speedup that llama.cpp rejected
This is the best joke in local LLM right now. There is a 12 line change to llama.cpp that gives up to 30% faster prompt processing for all MOE models. It was rejected.
PR #21344 does nothing clever. It just reorders the expert dispatch loop to avoid redundant memory loads. That is it. No tricks. No accuracy loss. No change to output.
Numbers on Strix Halo: 1106 t/s baseline, 1437 t/s with the patch at 512 context. That is not 5%. That is 30%. At 60k context it is still 12% faster.
It will never be merged. Maintainers said it breaks code cleanliness.
This is the state of edge LLM deployment right now. The biggest performance gains are not in fancy papers. They are 12 line patches sitting in rejected PRs that everyone copies and applies locally.
This is not the future you were promised
You were told edge LLMs would arrive when someone invented a magic 1 bit quantization algorithm. That was a lie.
Edge LLMs are arriving right now, from three boring, unglamorous directions:
- Someone finally fixed the geometry of low bit quantization instead of tuning scale factors
- Someone noticed we had built an architecture that was actively hostile to compression
- Some random guy on github fixed a loop and got rejected for it
None of these got press releases. None have twitter threads with 10k retweets. All of them work.
By the end of this year you will run a 35B MOE at 40 tokens per second on a laptop. You will not need a new GPU. You will just be running code that nobody upstream wanted to merge.