Appearance
The takeaway
All of the popular local LLM wrappers are slowing you down. A lot.
Ollama runs 38-72% slower than raw llama.cpp on the same hardware. LM Studio adds 170-2300ms of latency to every first token. And for the last 12 months almost nobody was measuring this properly. Everyone was benchmarking their own tool with tuned flags against default configurations of competitors.
That changed last week. We now have fair, reproducible cross-runtime benchmarks. And llama.cpp itself has closed the entire performance gap with vLLM on multi-GPU setups. You do not need to run a cloud stack at home to get good numbers.
The overhead no one measures
Every local LLM tool you have used sits on top of llama.cpp. None of them tell you how much speed they throw away to add their UI, model management, and API layers.
Until LlamaStash published their benchmark suite, there was no standard test that ran every tool with identical flags, identical model bytes, identical workloads, on the same hardware. Every comparison you saw was rigged.
The benchmark suite runs three test groups:
- Wrapper overhead: does the launcher add any cost over raw llama-server?
- Cross tool comparison: default out of the box performance for end users
- Proxy overhead: cost of going through an OpenAI compatible shim
All test code is committed to the repository. You can run the exact same harness on your hardware in three commands.
Benchmark results: raw vs wrappers
All numbers below are for the Ryzen AI Max+ 395 Strix Halo APU, running the same GGUF files across all tools. Values are decode tokens per second / time to first token in milliseconds.
| Tool | Gemma 4 E2B | Gemma 4 31B | Qwen3.6 27B | Qwen3.6 35B MoE |
|---|---|---|---|---|
| raw llama-server | 81.0 / 51 | 9.9 / 466 | 7.5 / 406 | 43.1 / 185 |
| LlamaStash | 82.1 / 51 | 9.9 / 468 | 7.5 / 406 | 42.3 / 178 |
| Ollama 0.24.0 | 50.8 / 224 | 4.8 / 1096 | 2.6 / 1750 | 12.2 / 484 |
| LM Studio 2.18.0 | 91.1 / 187 | crashed | crashed | crashed |
LlamaStash lands within 1% of raw llama-server on every model. That is not rounding error. The wrapper does nothing but fork the upstream llama-server binary and pass file descriptors. There is no intermediate processing. There is no patched llama.cpp fork.
Ollama loses between 38% and 65% of throughput on every test. On the 27B dense model it runs at one third the speed of raw llama.cpp. It also adds over a full second of latency before the first token arrives. For interactive chat this delay is the entire difference between something that feels responsive and something that feels slow.
LM Studio cannot load any model larger than 1.6B on this hardware at all. Its bundled ROCm 6.4 runtime aborts on startup for gfx1151 hardware. The system ROCm 7.2.3 runs the exact same models without issue.
Proxy overhead is not real
One of the most persistent myths in this space is that putting an HTTP proxy in front of llama-server will cost you meaningful performance.
This is not true. LlamaStash measured the delta between hitting llama-server directly and hitting its OpenAI compatible proxy:
| Platform | TTFT delta | Decode delta |
|---|---|---|
| AMD APU | +0.45 ms | 0% |
| Apple M1 | -0.6 ms | 0% |
| NVIDIA | +0.57 ms | -0.57% |
All deltas are sub-millisecond. You will never notice this. The overhead of establishing a TCP connection on loopback is smaller than the timing noise of the model itself.
Any latency you have ever experienced from a local LLM API is not inherent to running a proxy. It is bad code in the proxy implementation.
llama.cpp just caught up to vLLM
For almost two years the received wisdom was that if you wanted good multi-GPU performance you had to run vLLM. That is no longer true.
llama.cpp build b9455 added proper tensor parallel splitting. On a dual 3090 setup running Qwen3.6 27B UD-Q8_K_XL:
- Before b9455: 30-50 tok/s
- After b9455: 70+ tok/s
This matches exactly the throughput people were reporting for vLLM on the same hardware. And it works with standard GGUF files, no custom model formats, no Python runtime, no 10GB of dependencies.
Prefill performance also improved dramatically. Cold prefill of 27k tokens runs at 1417 tok/s. Cached prefill runs above 1100 tok/s. This is fast enough that you will not wait for prompt processing even on 128k context turns.
MTP speculative decoding adds another 50-100% throughput on code and structured output with zero accuracy loss. When this hits correctly you will see sustained decode above 80 tok/s on a 27B model.
You can build this for £200
You do not need dual 3090s. You do not need an RTX 5090.
One user put a secondhand Tesla V100 SXM2 into a standard gaming PC for a total cost of £200. That is £150 for the GPU, £50 for the SXM2 to PCIe adapter.
This card from 2017 has 16GB of HBM2 memory running at 900 GB/s bandwidth. That is 22% higher memory bandwidth than an RTX 4080. It beats every Apple Silicon Mac ever made.
When paired with an existing RTX 4080 this gives 32GB total VRAM. llama.cpp splits the model across both cards. The end result:
- Qwen3.6 27B Q5_K_M
- 128k context
- 32 tok/s sustained decode
- 160 tok/s prefill
This is faster than most cloud API endpoints. And this model scores within 5% of Claude Sonnet 4.6 on agent benchmarks.
The only catch was the fan. The stock adapter fan ran at 82 decibels. Rewiring it to run off a motherboard PWM header brought noise down to normal desktop levels for £2 worth of jumper cables.
The quiet revolution no one is covering
We have crossed an invisible threshold.
Right now you can go on eBay, spend £200, put together a machine that runs a model competitive with the best commercial closed models, at interactive speeds, completely offline, forever. No subscriptions. No rate limits. No terms of service.
Almost nobody is talking about this. All of the attention goes to new model releases, to cloud APIs, to wrapper tools that add features and remove performance.
The actual progress is happening one commit at a time in llama.cpp. It is happening on eBay listings for obsolete datacenter hardware. It is happening in people's bedrooms, not in startup press releases.
What you should actually run today
If you care about performance this is the current stack:
- Run raw llama.cpp build b9455 or newer. Do not run older versions.
- Use LlamaStash as your launcher if you want model management and an OpenAI API. It is the only wrapper that does not throw away performance.
- Use Q4_K_M for daily use, Q8_K_XL if you have the VRAM.
- Enable MTP speculative decoding for code generation.
- If you need more VRAM, buy a secondhand V100 before you buy any new consumer GPU.
Do not run Ollama unless you specifically need one of its features and are willing to accept 50% lower performance. Do not run LM Studio on AMD hardware right now.
Open problems
This is not a finished story. There are still very real gaps:
- LM Studio and Ollama have still not published any independent benchmarks justifying their overhead.
- ROCm support for new AMD hardware is still broken in every bundled runtime.
- Multi-GPU tensor split still does not balance properly across cards of different performance classes.
- NPU acceleration is still effectively useless for general purpose inference across every vendor.
- No one has yet built a good zero overhead wrapper that also works well over LAN.
All of these are solvable problems. None of them require new AI research. They just require engineers to stop building marketing features and start measuring what actually matters.
Closing
Performance is not a nice to have. It is the thing that determines whether you will actually use a local LLM, or just install it once and forget about it.
For three years everyone in this space accepted that wrapper overhead was unavoidable. It was not. It was just bad code.
We now have tools that do not throw away performance. We have hardware that costs less than a single month of cloud API credits. The only thing left is for people to stop using the slow ones.