Skip to content

Local LLMs stopped being a hobby: The 2026 production playbook

#local-llm #llama.cpp #quantization #ml-infrastructure #hardware

Hardware is no longer a one horse race

For three years the answer to "what do I run local LLMs on" had one correct answer: NVIDIA. That is no longer true.

This is not an opinion. This is what people running production workloads are reporting right now. On a Radeon 9070 XT, Gemma4 26B A4B runs 30-40% faster than an equivalently quantized Qwen 35B runs on a comparable RTX 5070 Ti. No one is making this up. You can go verify the benchmarks right now. AMD has caught up on llama.cpp kernel support. You will still have a bad time if you want to pre-train large models from scratch. For inference, which is what 95% of teams actually do, AMD is a valid, often better choice right now.

Even more importantly: stop buying flagship cards. The sweet spot for local deployment right now is 4x 5060 Ti 16GB. No one tells you this. Everyone will still tell you you need one big expensive card. For 90% of workloads four cheap 16GB cards stacked will give you higher total throughput, higher effective context, and lower total cost than one single big card.

If you undervolt, you can run four 5060 TIs one slot apart. You do not need 2 slot gaps. You do not need liquid cooling. 10 case fans will keep them within operating spec for 5 years. Undervolt by 12%. You lose 3% performance. You cut power draw by 38%. This is not a tradeoff. It is free.

NVIDIA still has the best training stack. For inference. The NVIDIA is no longer the default choice. Stop acting like it is.

1.58 bit quantization stopped being a meme

Six months ago 1.58 bit ternary quantization was a weird research demo that looked good on benchmarks but fell apart the second you asked it to do anything real. That changed.

BitCPM-CANN is the result that breaks this. They trained models from scratch in ternary. Not quantized after training. Ran the entire pre-training run at 1.58 bits.

The numbers are unambiguous. 8B parameter model retains 97.2% of full precision performance across 11 benchmarks. 3B variant hits parity on BBH. It recovers almost all GSM8K performance. Training overhead is 4.5%. That is noise.

This is not a trick. Done correctly you get 6x memory reduction for 3% quality loss. That is the best deal this field has ever gotten.

Most importantly: this works on Ascend NPUs. This is the first serious large scale LLM training result that runs entirely outside the CUDA ecosystem. That is the actual news here that no one is talking about. CUDA lock in is not breaking because someone wrote a good kernel. It is breaking because people are now training production grade low bit models natively on other hardware.

By the end of this year every new model released will be trained natively at 1.58 or 2 bit. Full precision training will be considered a waste of compute for only the very largest base models only. Everyone will look back at 4 bit quantization the same way we look at 16 bit inference now.

The actual good local models you should be running right now

Stop testing every model that gets posted. Right now there are exactly two production grade local models that are worth your default consideration.

Qwen3.6 35B A3B is the best general purpose model. It has good reasoning, good tool use, good long context. The 200k context works. It will not fall over. The community fine tunes are stable. The Genesis APEX quant posted last week will run five concurrent 200k context sessions on 24GB VRAM. No OOM. No degradation. No hallucinated tool calls. At 78 t/s. That was impossible 6 months ago.

Gemma4 26B A4B is the second one. It is slightly worse on most benchmarks. It runs 40% faster. If you are running on AMD it is not even close.

You do not need 70B models any more. You have not needed 70B models for 3 months. Stop running them out of habit.

One last note on uncensored models. They are not just for roleplay. They are also not magic. Most of the popular uncensored fine tunes have silent degradation on general reasoning that no one publishes benchmarks for. You will not notice this for three weeks. Then you will get a wrong answer on something trivial and you will not trace it back to the fine tune.

If you are building internal tools, base models are almost always better. You can override alignment guardrails with one line in the system prompt 98% of the time. That is always better than breaking the model.

Native tooling killed the wrapper stack

For two years everyone was building agent frameworks. LangChain. LlamaIndex. 17 different wrapper layers. All of them are obsolete.

llama.cpp server now has native built in tools. Read file. Write file. Edit file. Grep. Run shell commands. Get time.

You enable them with one flag. No wrappers. No middleware. No extra dependencies. The model will call them correctly. There is no glue code.

This is not experimental. This works right now on the main branch.

Everyone spent three years building 10000 line libraries to do this one thing. The llama.cpp maintainers just added it to the server binary and did not even announce it properly. It just showed up in a commit.

There is no reason to run any other inference server for local workloads right now. None. Every other one is slower, has more features you do not need, and breaks more often.

There is no sandbox yet. Do not expose this to the internet. Do not run it on anything you do not own. On your workstation. On an internal server that only your team can reach. This is perfect.

This is the pattern that will win. All the agent frameworks will die. Good inference runtimes will absorb all the functionality. Everything else is unnecessary overhead.

Things everyone still gets wrong

Almost all common advice about local LLMs is 12 months out of date.

You do not need ECC memory for inference. No one running production local deployments uses ECC. It does not matter. Errors are extremely rare and undetectable on inference workloads.

You do not need PCIe 4.0. For 4 bit and lower quantization you will not measure a difference above PCIe 3.0 x8. Stop paying the premium.

Repeat penalty should always be 1.0. All the preset values from 2024 were workarounds for bad quantization artifacts that do not exist any more.

Temperature 0.7 is not the default for everything. Qwen3.6 works best at 0.7. Gemma4 works best at 0.6. Every model is different. Stop copying settings from old guides.

Most importantly: stop optimizing for tokens per second on empty context. No one cares how fast the model outputs the first 10 tokens. Care how fast it outputs tokens at 150k context. That is the number that actually matters for real workloads. That is the number almost no one benchmarks.

Closing

Local LLMs are no longer a hobby. They are no longer something you mess around with on a spare GPU at home. They are production ready.

They are cheaper. They are predictable. You know exactly what code is running. You know exactly what data leaves your network. You do not have to negotiate API rate limits. You do not get random silent model updates that break everything you built.

None of this is perfect. There are still sharp edges. There is still bad documentation. There is still a mountain of garbage posted every day that you have to filter through.

But the line crossed. For most teams building internal tools, for most use cases, running local is now the better choice. Not the cheaper choice. Not the privacy choice. The better choice.

That is the shift that happened in the last six months. Almost no one outside this small corner of the internet has noticed yet.