Appearance
Summary
Local LLM inference is no longer a novelty for people with 24GB VRAM cards. Quantized 2B-parameter models run on hardware from a decade ago at usable speeds, tooling now auto-matches models to your specific machine, and private RAG pipelines compress retrieval indices by 97% to fit on laptops. The question has shifted from "can I run this?" to "which model should I run, and how do I wire it into my workflow?"
Background & Context
Two years ago, running an LLM locally meant compiling llama.cpp, guessing quantization levels, and hoping your 8GB GPU could load a 7B model before OOMing. The ecosystem was fragmented: different inference engines, no standard hardware compatibility checks, and zero tooling for figuring out what your machine could actually handle.
Three things changed simultaneously. First, model distillation and quantization got aggressive enough that 2B and 4B models produce useful output for coding, summarization, and retrieval tasks. Second, inference runtimes (llama.cpp, MLX, ONNX Runtime) matured to the point where "install and run" is one command. Third, a crop of tools emerged to bridge the gap between "I have hardware X" and "here is the model that runs best on it."
The practical upshot: engineers are now building fully offline AI workflows. One developer documented a complete Arch Linux setup on an ASUS ROG Flow Z13 with local AI coding assistance running through Niri, no cloud calls at all. Another benchmarked Gemma 4 on a 2015 desktop. The hardware floor is lower than most people assume.
Technical Deep Dive
Hardware floor: what a 2015 desktop can do
The Gemma 4 benchmark on old hardware is instructive because it establishes a concrete baseline. A typical 2015 desktop ships with an Intel Haswell or Skylake i5/i7, 8-16GB DDR3 RAM, and integrated graphics or a GTX 900-series card. No modern GPU acceleration. Pure CPU inference.
Gemma 4 2B at Q4_K_M quantization requires roughly 1.5GB of RAM for model weights. On a Haswell i5 with DDR3-1600, memory bandwidth sits around 25 GB/s. The arithmetic for token generation: each token requires reading the full model weights once (compute-bound for small models becomes memory-bandwidth-bound). At 25 GB/s and 1.5GB weights, theoretical throughput is roughly 16-17 tokens/second before accounting for KV cache overhead and compute latency. Real-world benchmarks show 8-12 tokens/second, which is readable in real time.
Gemma 4 4B at Q4_K_M requires roughly 2.5GB. Same bandwidth gives theoretical ~10 tokens/second, with real-world around 5-8 tokens/second. Slower, but still functional for batch tasks like code completion where you are not watching tokens stream.
The key variable is memory bandwidth, not FLOPS. This is why older hardware performs better than people expect: DDR3 bandwidth has not improved as dramatically as GPU compute over the past decade. A 2015 desktop with dual-channel DDR3-1600 has more bandwidth available to the CPU than a 2024 laptop with single-channel LPDDR5 at equivalent capacity.
Model selection tooling: whichllm and canirun.ai
The problem with local inference has never been "can any model run?" It has been "which model should I run on this specific machine?" The naive approach is checking parameter count against available RAM with a 1.5x overhead factor. This fails because it ignores quantization level, context length, KV cache sizing, and the fact that different architectures have different memory-to-quality tradeoff curves.
whichllm takes a different approach. It ranks models by real, recency-aware benchmarks rather than parameter count. The tool runs a single command, profiles your hardware (CPU cores, memory bandwidth, available RAM, GPU VRAM if present), and returns a ranked list of models that will actually run at usable speeds. The recency weighting matters because a Q4_K_M quant of Llama 3.1 8B from six months ago outperforms a Q5_K_M of Llama 2 7B from last year on most benchmarks, but parameter-count sorting would rank them identically.
canirun.ai provides a complementary function: a web-based hardware compatibility checker. You input your specs, it tells you what can run. Less granular than whichllm but useful for quick feasibility checks before downloading multi-gigabyte model files.
Full-stack local: DreamServer
DreamServer bundles the entire local AI stack into one package: LLM inference, chat UI, voice input/output, agent workflows, RAG, and image generation. No cloud, no subscriptions. It targets the "I want local AI but do not want to assemble it from parts" use case.
The architecture uses llama.cpp as the inference backend (or MLX on Apple Silicon), with a web-based frontend for chat and workflow configuration. RAG is handled through a local vector store with embedding models running alongside the generative model. Image generation uses Stable Diffusion variants quantized to run on 6-8GB VRAM.
The practical constraint is RAM. Running a 7B chat model (Q4, ~4GB), an embedding model (~500MB), a vector index, and Stable Diffusion (~4GB for SDXL at Q8) simultaneously requires 16GB of system RAM minimum, 24GB for comfortable operation. This is within reach of most developer machines from the last three years.
Private RAG: LEANN
LEANN addresses a specific bottleneck in local RAG: storage. Traditional RAG pipelines store full document chunks in a vector database. For a personal knowledge base of 10,000 documents averaging 2KB per chunk, that is 20GB of raw text plus embedding indices. On a laptop with 512GB storage, dedicating 20GB to a RAG index is feasible but wasteful.
LEANN achieves 97% storage savings through a combination of approximate nearest neighbor search with compressed representations and learned index structures. Instead of storing full chunks, it stores compressed embeddings with lightweight reconstruction. The tradeoff is a small accuracy hit on retrieval (typically 1-3% drop in recall@10 depending on the dataset), which is acceptable for most personal RAG use cases where the alternative is no RAG at all because the index does not fit on disk.
The system runs entirely on personal devices. No external API calls for embeddings, no cloud vector store. The embedding model, the retrieval index, and the generative model all run locally. For an ML engineer building a private knowledge assistant that cannot send data to OpenAI or Anthropic, this is the architecture.
Comparison & Analysis
The local inference space has two main approaches: hand-assembled toolchains (ollama + Open WebUI + ChromaDB + custom glue code) versus integrated stacks (DreamServer, LM Studio).
Hand-assembled toolchains give you flexibility. You pick the exact inference engine, quantization level, vector store, and embedding model. The cost is integration effort. Wiring ollama's API to a chat frontend, then adding RAG with a local vector store, then connecting voice I/O, takes days of configuration and debugging. Each component updates independently, so things break.
DreamServer trades flexibility for convenience. One install, everything works together. The cost is lock-in to DreamServer's choices for inference backend, vector store, and model formats. If you need a specific quantization scheme or want to swap in a custom embedding model, you are modifying DreamServer's codebase rather than swapping a component.
For the RAG layer specifically, LEANN compares against traditional pipelines built on ChromaDB or Qdrant. A standard ChromaDB setup with 10K documents uses approximately 2-4GB for embeddings plus the raw text storage. LEANN's compressed approach reduces this to roughly 100-200MB. The recall drop from compression is measurable but narrow: LEANN reports 94-97% of the retrieval quality of uncompressed indices on standard benchmarks (MTEB retrieval tasks). For personal RAG where you are querying your own notes and documents, a 3% recall hit is an easy trade for 97% storage savings.
The hardware benchmarking approach in whichllm also deserves comparison against the status quo. Most local AI guides recommend models by parameter count and RAM tiers: "8GB RAM = 7B Q4, 16GB = 13B Q4, 32GB = 70B Q4." This is crude. A M2 MacBook Air with 16GB unified memory runs 7B Q4 models faster than a desktop i7 with 32GB DDR4 because the M2's memory bandwidth (100 GB/s) dwarfs the desktop's dual-channel DDR4-3200 (50 GB/s). whichllm accounts for this by benchmarking actual token generation speed on your hardware rather than relying on RAM-only heuristics.
Practical Implications
For ML engineers building edge deployments or local-first products, the current state of local inference tooling has three concrete implications.
First, hardware requirements are lower than your product specs probably assume. If your target user has any laptop made after 2018, you can run a 2B-4B quantized model at interactive speeds. The 2015 desktop benchmark proves this extends even further back for non-interactive workloads. Do not over-provision your hardware recommendations.
Second, model selection should be automated, not manual. Tools like whichllm show that the optimal model for a given machine depends on memory bandwidth, core count, and GPU architecture in ways that are not obvious. Building hardware-aware model selection into your deployment pipeline (rather than shipping one model and hoping it works) will reduce support burden and improve user experience.
Third, private RAG is now deployable on consumer hardware. LEANN's compression means you can ship a RAG application that runs entirely on-device with a reasonable storage footprint. For healthcare, legal, and financial applications where data cannot leave the device, this removes the last major objection to local AI deployment.
The deployment path looks like this: profile the target hardware, select the largest model that runs at your minimum acceptable token speed, quantize to fit, and use compressed RAG indices for retrieval. The tooling exists to automate every step. What remains is integrating these pieces into coherent products rather than developer toolkits.
The offline Arch Linux setup on the ROG Flow Z13 is a proof point: a fully local development environment with AI coding assistance, zero cloud dependencies, running on consumer hardware. It works because every layer of the stack (inference, UI, RAG, voice) now has a local-first option. The assembly is still manual, but the components exist.
References
- "My fully offline AI-assisted Linux development machine" — https://dev.to/deepu105/my-fully-offline-ai-assisted-linux-development-machine-3lnl
- "Old PC vs New AI: Can a 2015 Desktop Actually Run Gemma 4? (2B vs 4B Benchmark)" — https://dev.to/gramli/old-pc-vs-new-ai-can-a-2015-desktop-actually-run-gemma-4-2b-vs-4b-benchmark-2eg6
- "Local AI needs to be the norm" — https://unix.foo/posts/local-ai-needs-to-be-norm/
- "Can I run AI locally?" — https://www.canirun.ai/
- "Light-Heart-Labs/DreamServer" — https://github.com/Light-Heart-Labs/DreamServer
- "Andyyyy64/whichllm" — https://github.com/Andyyyy64/whichllm
- "yichuan-w/LEANN" — https://github.com/yichuan-w/LEANN