Skip to content

Local AI in 2026: What Actually Runs, What Doesn't, and What's Worth Building

#local-ai #on-device-inference #llama.cpp #hardware-benchmarks #agent-workflows

Local AI stopped being a curiosity sometime in the last year. The models are small enough, the hardware is fast enough, and the tooling is good enough that running inference on your own machine is now the rational choice for a growing class of tasks. Not all tasks. Not most tasks. But enough that every ML engineer should have a working mental model of what local deployment actually looks like in 2026: which hardware matters, which backends win, and how to build real workflows instead of demo toys.

What follows is a synthesis of benchmarks, setup reports, and production workflows from engineers who are already running locally. The data is specific. The conclusions are opinionated.

Memory bandwidth is the only hardware metric that matters

A Reddit user spent three days running standardized tests across an RTX 6000, an M5 MacBook Pro, an NVIDIA DGX Spark, and an AMD Strix Halo system. The results confirmed what anyone who has profiled a transformer already knows: token generation speed tracks memory bandwidth with almost no deviation.

The RTX 6000 has roughly 1,800 GB/s of memory bandwidth. The M5 sits around 600 GB/s. The DGX Spark and Strix Halo both hover near 256 GB/s. Tokens per second follows that curve directly. No surprise. Autoregressive decoding is a memory-bound workload. Every generated token requires reading the entire model weight matrix from VRAM. Compute sits idle waiting for data.

The practical takeaway: if you are buying hardware for local inference, prioritize memory bandwidth over FLOPS. A GPU with 1,800 GB/s and 24 GB VRAM will outperform a unified memory system with 600 GB/s and 128 GB VRAM for single-stream generation, because the decode loop does not need 128 GB of context for most tasks. It needs weights loaded fast.

The M5 result is worth noting on its own. For the price point, and assuming you are not locked into CUDA, the maxed-out M5 genuinely outperforms the DGX Spark. Twice the memory bandwidth with the same total unified memory. The MacBook also held up thermally over multi-day runs, cruising around 80°C. But the "quiet Mac" narrative needs correcting: under sustained AI workloads, an M5 MacBook Pro sounds like a gaming laptop. It moves serious air. Anyone telling you these machines are silent under load has not run a 27B model for three hours straight.

If you want to check whether your existing hardware can handle a specific model, canirun.ai is a straightforward resource that maps model requirements against your specs.

Your backend choice is worth 2x performance on the same hardware

The single most impactful optimization you can make for local inference has nothing to do with hardware. It is your inference backend.

A detailed benchmark on an RTX 3090 (24 GB VRAM) running Qwen 3.6 27B tells the story clearly. The test used a real workload: a ~5,900 token prompt with 1,024 tokens of output generation. This is not a synthetic best-case benchmark. It is a code review task over local setup files.

Results with the IQ4_KS quantization of Qwen 3.6 27B:

  • ik_llama.cpp: 1,261 tok/s prefill, 72.9 tok/s decode
  • llama.cpp (upstream): slower on both metrics, though still usable as a baseline
  • beellama.cpp: promising on paper but could not reproduce expected speeds on this setup
  • vLLM: saw ~78 tok/s on responses but hit OOM cliffs on high-context runs, flagged as unresolved for single-card long-context

The winning configuration used ik_llama.cpp build 4507 with these flags: --ctx-size 156000, --cache-type-k q8_0, --cache-type-v q8_0, --flash-attn on, --multi-token-prediction, --draft-max 4, --draft-p-min 0.0, --merge-qkv, --merge-up-gate-experts, --cache-ram 32768.

That is 156K context on a single 24 GB card. The IQ4_KS quantization, KV cache at q8_0, flash attention, multi-token prediction, and the QKV/up-gate-expert merges all contribute to fitting the model and running it fast. This is not a trivial configuration. You need to understand what each flag does. But the payoff is a 27B model with 156K context running at nearly 73 tok/s decode on consumer hardware.

ik_llama.cpp is a fork of llama.cpp maintained by ikawrakow that focuses on performance optimizations for specific hardware targets. It is not a drop-in replacement in every case, and build versions matter. But for single-GPU setups where you are trying to maximize what fits in VRAM, it is currently the best option.

The broader point: switching from stock llama.cpp to ik_llama.cpp on the same hardware with the same model can give you meaningful speed improvements and better VRAM utilization. If you have not tested your workload across backends, you are leaving performance on the table.

A 2015 desktop can run Gemma 4. The question is whether you should.

The "Old PC vs New AI" benchmark asks a question many engineers have: can that old machine in the closet actually run a modern model? The answer is yes, with qualifiers.

Gemma 4 2B and 4B were tested on 2015 desktop hardware. The 2B model runs. The 4B model runs. Neither runs fast by 2026 standards, but both produce usable output. The 2B variant is obviously more comfortable on old silicon.

This matters more than it seems. Not every inference task needs 73 tok/s. If you are running a background process that classifies text, extracts entities, or generates short summaries, a 2B model on old hardware at 5-10 tok/s is perfectly adequate. The latency is acceptable. The cost is zero. The data never leaves the machine.

The mistake is expecting these small models on old hardware to handle general chat or complex reasoning. They are tool models. Use them for narrow, well-defined tasks where the input and output formats are constrained, and they perform well outside their weight class.

Browser inference works. It also has real costs.

Running models directly in the browser via WebGPU or WASM is now feasible. The question is what it does to your page performance.

A detailed test measured Core Web Vitals before and after loading browser-based AI models. The results are a useful corrective to the "just ship it in the browser" enthusiasm. Loading and running inference on a model in the browser impacts Largest Contentful Paint, Total Blocking Time, and Cumulative Layout Shift. The exact numbers depend on the model size and the host page, but the direction is consistent: browser inference is expensive.

This is not an argument against browser inference. It is an argument for being deliberate about when you use it. Browser-based sentiment analysis or speech recognition that runs on user input without sending data to a server is a genuine privacy win. But you need to account for the performance budget. Lazy-load the model. Run inference off the main thread. Consider whether a smaller model or a simpler heuristic would serve the same purpose.

The browser is a hostile environment for heavy compute. You share the main thread with rendering, event handling, and garbage collection. You have no control over the user's thermal throttling. You cannot assume WebGPU is available (it is not, on many mobile browsers as of mid-2026). Browser inference is a tool, not a default.

Structured workflows with small models beat cloud for daily development

This is the most interesting pattern in the current local AI space, and the one most likely to change how engineers work day to day.

A Reddit user documented a 28-day journey building a custom agent loop around Qwen 3.5 9B. The progression is instructive:

  1. Started with a basic home-rolled agent loop with a handful of tools. Found it surprisingly effective despite being crude.
  2. Got addicted to improving it, eventually reaching the point where the agent could edit its own code.
  3. Hit the human bottleneck: the agent sat idle waiting for approvals and reviews while the todo list grew.
  4. Hit the model bottleneck: a 9B model has limited intelligence and a small context window. Cannot dump hundreds of files into it and expect coherent processing.
  5. Solved both with structured workflows: map-reduce patterns that break tasks into smaller chunks, run them in parallel, and reduce the results. Structured outputs to reduce LLM variability and make the reduce step deterministic. A database to monitor and track workflow state.
  6. Wrapped the entire pattern into a "skill" so a single instruction creates the workflow with Python guardrails, parallel execution, monitoring, checkpointing, and recovery.

The result: this custom agent replaced Claude Code for 99% of tasks. The 1% is when the agent breaks itself during development.

This is the key insight for local AI in 2026. A 9B model is not smart enough to replace Claude on its own. But a 9B model inside a structured workflow with tool access, parallel execution, checkpointing, and deterministic guardrails can handle the vast majority of daily development tasks. The intelligence is in the workflow, not the weights.

The map-reduce pattern is particularly clever. You cannot give a 9B model a 100-file codebase and ask it to refactor. But you can give it one file at a time, extract structured findings from each, and then run a second pass that synthesizes the results. Each individual call stays within context limits. The overall task exceeds what the model could do in a single shot.

This is also where local inference has a structural advantage over cloud. When you are running 50 parallel agent calls through a map-reduce pipeline, the cost of cloud inference adds up fast. At home, the marginal cost of each additional call is the electricity to keep your GPU running. The economics flip once you move from single-shot prompting to agentic workflows.

The fully offline development machine exists today

Deepu Remya documented a fully offline AI-assisted Linux development machine built on an ASUS ROG Flow Z13 running Arch Linux with Niri (a scrollable-tiling Wayland compositor) and local AI coding tools.

This is not a theoretical setup. It is a daily driver. No cloud API calls. No telemetry. No internet dependency for AI features. The models run locally. The code stays local. The data stays local.

The philosophical case for this approach is laid out plainly in the "Local AI needs to be the norm" essay: cost, latency, privacy, and sovereignty. Every prompt you send to a cloud API is data you no longer control. Every API dependency is a point of failure. Every per-token cost is a tax on your iteration speed.

The practical case is that the tooling now supports it. Docker Model Runner lets you run Claude Code locally without hitting Anthropic's API. llama.cpp and its forks run on everything from old desktops to MacBooks to mini PCs. Quantized models at IQ4, Q4_K_M, and similar levels fit 7B-27B parameter models on consumer hardware with acceptable quality loss.

Canirun.ai provides a simple lookup: plug in your hardware, see what models you can run. The Strix Halo mini PC ecosystem (documented in an updated May 2026 chart) gives you a growing list of small-form-factor machines with enough unified memory to run 14B-32B models comfortably.

What I would build

If I were setting up a local AI workstation today, here is what I would do, based on the data above.

For hardware, I would prioritize memory bandwidth above all else. An RTX 3090 or 4090 with 24 GB VRAM gives you the best price-to-bandwidth ratio for consumer hardware. If you need unified memory for large contexts and prefer not to manage GPU/CPU memory splits, a maxed-out M-series Mac is the next best option. The DGX Spark and Strix Halo systems are interesting but currently sit in an awkward middle ground: not enough bandwidth to compete with dedicated GPUs, not enough unified memory advantage over the M5 to justify the ecosystem lock-in.

For software, I would start with ik_llama.cpp and test against stock llama.cpp. The configuration from the Qwen 3.6 27B benchmark is a good template: IQ4_KS quantization, q8_0 KV cache, flash attention, multi-token prediction, and the relevant merge flags. Copy those flags, adjust context size to your VRAM, and benchmark.

For workflows, I would invest time in structured agent patterns before buying bigger hardware. A 9B model with a map-reduce pipeline, structured outputs, and checkpointing will outperform a 70B model used as a chat interface for most development tasks. The agent loop does not need to be sophisticated. It needs to be reliable, parallelizable, and able to recover from failures.

For browser deployment, I would use browser inference only when the privacy benefit is clear and the performance budget allows it. Sentiment analysis on user input before it leaves the device. Speech recognition without a server round-trip. Not general-purpose chat.

The local AI space in 2026 is not about replacing cloud services entirely. It is about having a real choice. For narrow tasks, old hardware works. For daily development, structured workflows with small models work. For maximum single-stream performance, memory bandwidth and backend choice matter more than any other variable. The tools exist. The benchmarks are public. The only remaining question is whether you are willing to configure your own stack instead of reaching for an API key.