Skip to content

The State Of Local LLM Deployment, June 2026

#local-llm #inference-optimization #llm-operations #kv-cache #ai-gateway

Stop building AI infrastructure from scratch

If you are running self hosted LLMs today, you are almost certainly reinventing problems that already have good solutions.

Over the last four weeks the community shipped almost every missing piece required to run local or self hosted LLMs at production quality. This is not paper research. All of this works right now, you can deploy it this week.

This article walks through what shipped, what works, what doesn't, and what you should be using today.

AI gateways are no longer optional

Six months ago AI gateways were a curious demo. Today they are the first layer you should deploy before any LLM backend.

The pattern is simple. You place a proxy between every client and every LLM backend. Every request goes through it. This is not a new idea. This is exactly what API gateways have done for regular services for 15 years.

Nobody builds a production API without a gateway. Nobody will be building production AI infrastructure without one by the end of this year.

You get in one place:

  • Uniform OpenAI compatible API for every backend and provider
  • Automatic failover and load balancing
  • Per user, per model budget enforcement
  • Full observability and logging
  • Provider lock in elimination
  • Semantic caching

The best implementation available today is Bifrost. It adds 11 microseconds of overhead at 5000 requests per second. That is effectively zero. You will not measure the latency penalty.

One extremely underrated use case is client decoupling. You can run unmodified Claude Code client against Devstral, Mistral, Llama 3 or any local model. You just rewrite the request and response on the fly. No patches to the client required.

There are still rough edges. At time of writing Bifrost does not correctly translate Anthropic reasoning effort parameters between providers. You can work around this with CLAUDE_CODE_DISABLE_THINKING=1 for now. The bug is open and will almost certainly be fixed within a month.

LiteLLM and OpenRouter are valid alternatives. Bifrost is currently the best option for self hosted deployments.

You can run production workloads on Spot VMs

You can cut your GPU compute costs by 90%. All you have to do is stop treating your compute nodes as permanent.

Spot VMs on GKE, EKS or any major cloud are 70-90% cheaper than on demand instances. They can be evicted at any time. Most engineers try them once, get their job killed after 37 hours, and never touch them again.

This is not an infrastructure problem. This is an application design problem.

Building interrupt resilient workloads only requires four things:

  1. Catch the SIGTERM signal. You get 15 seconds warning before the node is killed. That is more than enough time to flush state and exit cleanly.
  2. Write checkpoints to external object storage every 5-15 minutes. Never store anything important on the node local filesystem.
  3. Make every operation idempotent. Running the same job twice must produce exactly the same result.
  4. Pull work from an external queue. Never hardcode work lists inside your job container.

Do this correctly and your workloads will survive evictions transparently. No user will ever notice. You will never go back to on demand GPUs.

This is not theory. Every major LLM training run today runs on 100% spot capacity. There is no reason your inference and fine tuning jobs cannot do the same.

KVarN rewrites the KV cache tradeoff

For two years every KV cache optimization has followed the same pattern: you trade a small amount of model quality for a large reduction in memory usage. And usually you also get a speed penalty.

KVarN breaks this pattern.

This is a new quantization scheme from Huawei, recently ported to llama.cpp. It delivers Q5 quality at 4 bit size. It delivers Q4 quality at 3.5 bit size.

Looking at the benchmark numbers for Qwen 3.6 27B at 64k context: At identical 28% of original bf16 cache size:

  • Standard q4_0 has 99.57% mean precision
  • KVarN 4 bit has 99.74% mean precision

That is almost Q6 quality at Q4 size.

At the tail 99.9th percentile the gap is even larger. KVarN avoids the catastrophic quality collapse that plagues every other KV cache quantization scheme at long context. For agentic work this difference is not marginal. It is the difference between a model that reliably completes tasks and one that forgets half the instructions halfway through.

This is the single largest improvement to local inference performance released in the last 12 months. You should enable it on every deployment you run today.

The current llama.cpp implementation is not yet optimized for speed. It runs about 10% slower than unquantized cache. The reference vLLM implementation runs faster than baseline. This gap will close over the next month.

Quantization aware training has arrived

Post training quantization was always a compromise. You take a model trained for fp16, then you chop bits off and hope it still works.

Google changed this with Gemma 4. They trained the model end to end with quantization baked in. The model learns to operate correctly at 4 bit precision during training.

The quality difference is not subtle. A QAT 4 bit Gemma 4 12B outperforms a post trained quantized 8 bit model. It fits entirely in 9GB of VRAM. It runs at 50 tokens per second on a mid range GPU.

This is the new standard. Every model released 12 months from now will be trained with quantization awareness. Post training quantization will be considered a legacy technique for old models.

Unsloth has already published MTP GGUF weights for all Gemma 4 sizes. These are the best general purpose local models available today.

Hardware builds for local inference

If you are building a dedicated local inference server today, there is one correct configuration.

4x RTX 3090. 96GB total VRAM.

That is it.

Nothing else comes even close to the price performance. Used 3090s trade for $600-$700 each. For $2800 you get 96GB of ECC VRAM that will run every model up to 70B 4 bit at usable speed.

The optimal supporting hardware is an AMD EPYC 9575F, 768GB DDR5 ECC RAM. This combination will run vLLM and llama.cpp at maximum throughput.

You do not need 4090s. You do not need H100s. You do not need anything released after 2021. The local LLM community has effectively standardized on the 3090 as the baseline inference hardware.

This will remain true for at least another 18 months.

oMLX fixes local inference on Apple Silicon

For two years running local LLMs on Apple Silicon was a constant battle between broken runtimes, bad memory management and terrible cache behaviour.

oMLX fixes this.

It is a menu bar app for macOS that runs a fully featured LLM server in the background. It implements tiered KV caching that persists across requests, across server restarts, and offloads unused cache blocks to SSD.

You can pin frequently used models in memory. Unused models are automatically evicted. It exposes a standard OpenAI compatible API. Every client works with it unmodified.

This is the first local LLM runtime that just works. You install it, download a model, and forget about it. It stays running in the background. It will not crash. It will not leak memory.

If you run local models on a Mac you should uninstall LM Studio today and install oMLX instead.

BitNet makes 100B models run on CPU

BitNet is no longer a research paper.

Microsoft released bitnet.cpp last month. It runs 1.58 bit ternary models on standard CPUs at usable speed.

A 100B BitNet model runs at 5-7 tokens per second on a single modern x86 CPU. No GPU required.

This is not fast. But it is fast enough for many use cases. And it means you can run a state of the art large model on any server, any laptop, almost any device.

The bottleneck now is model availability. Very few models are trained natively for 1.58 bit today. This will change very quickly.

vLLM Omni standardizes multimodal serving

vLLM Omni is the new upstream runtime for multimodal inference.

It extends the existing vLLM stack with native support for images, video, audio and diffusion models. It uses the same paged KV cache system, the same scheduler, and the same API.

You can run text, vision, audio and TTS models on the same server instance. All use the same operational tooling, the same logging, the same scaling logic.

This replaces half a dozen separate specialized runtimes that everyone was stitching together. It will become the default multimodal serving runtime by the end of the year.

What still doesn't work

None of this is perfect.

Bifrost still has broken parameter translation between provider APIs. KVarN speed optimizations have not landed in mainline llama.cpp. BitNet has almost no public production ready models. oMLX only runs on macOS. Spot VM graceful shutdown still does not work correctly on EKS.

All of these are temporary problems. All will be fixed. None of them should stop you from deploying this infrastructure today.

Closing

We have crossed an invisible line.

Six months ago running self hosted LLMs at production quality required a team of dedicated engineers. Today one engineer can deploy a full stack over a long weekend. It will be cheaper, faster, more reliable and more private than any commercial API.

The infrastructure is ready. The models are ready. The only thing missing is people actually building with it.