Skip to content

Qwen3.6, DeepSeek-V4, and the New Math of Local LLM Deployment

machine-learning
#moe #gguf #open-source-llms #multi-token-prediction #local-deployment #quantization

Summary

The latest wave of open-source releases pairs mixture-of-experts architectures with multi-token prediction heads, then ships GGUF-quantized weights for llama.cpp. The result: models like Qwen3.6-35B-A3B run 35 billion total parameters but activate only 3 billion per token, fitting in VRAM that used to barely hold a 7B dense model. For anyone deploying LLMs on consumer GPUs, this changes the cost-quality tradeoff substantially.

Background & Context

Meta's position paper "Open Source AI Is the Path Forward" argued that freely available model weights accelerate iteration and improve safety through scrutiny. Whatever you think of Meta's motives (they also benefit from commoditizing the layer where their competitors charge margins), the practical effect is real. The rate of open-weight model releases has made it genuinely hard to keep up.

The constraint for practitioners has always been hardware. A 70B dense model needs roughly 140GB of VRAM in FP16, or about 40GB in 4-bit quantization. That means multiple A100s or a very expensive Mac Studio. Most engineers working on local deployment or on-prem inference don't have that budget.

Two architectural shifts are cracking this problem open. Sparse mixture-of-experts (MoE) models decouple total parameter count from per-token compute by routing each token through a subset of experts. Multi-token prediction (MTP) adds auxiliary prediction heads that train the model to forecast several future tokens at once, which then serve as cheap draft tokens for speculative decoding at inference time. Combine both with 2-4 bit GGUF quantization, and you get models that previously required datacenter hardware running on a single consumer GPU.

Technical Deep Dive

Mixture-of-Experts routing in Qwen3.6

The Qwen3.6-35B-A3B variant is the more architecturally interesting of the two Qwen releases. Total parameters: 35 billion. Active parameters per forward pass: 3 billion. The model uses a top-K expert routing scheme where each token is dispatched to K experts (typically K=2) from a larger pool.

The routing mechanism works as follows. Each MoE layer contains N expert networks (feed-forward modules) and a lightweight gating network G(x). For an input token representation x, the gating network produces a probability distribution over experts:

G(x) = softmax(W_g · x)

The top-K experts are selected, and the output is a weighted combination:

output = Σ_{i∈topK} G(x)_i · E_i(x)

where E_i is the i-th expert's feed-forward computation. The key engineering detail is that only the selected experts' parameters are loaded and computed for each token. If you have 64 experts and route to 2, you compute roughly 2/64 of the feed-forward parameters per token.

For Qwen3.6-35B-A3B specifically, the 35B total breaks down into shared attention parameters (which always activate) plus the expert FFN parameters (most of which sit idle per token). The 3B active count means the compute cost per token is comparable to a 3B dense model, while the knowledge capacity approaches what 35B parameters can store.

The dense Qwen3.6-27B-MTP variant skips MoE entirely. It is a straightforward transformer with 27 billion parameters, but it shares the MTP heads discussed below.

Multi-token prediction heads

MTP is the second piece. Standard autoregressive models predict one token at a time. MTP adds n-1 auxiliary prediction heads that each forecast a future token given the current hidden state. During training, all heads receive gradients, which improves representation quality. During inference, the auxiliary heads generate candidate tokens for speculative decoding.

The training loss becomes:

L = (1/n) Σ_{k=1}^{n} CE(p_k, t_k)

where p_k is the prediction from the k-th head and t_k is the target token at position k steps ahead. Each head has its own output projection (tied or untrained embeddings depending on implementation), but they share the trunk transformer layers.

At inference, the main head produces the canonical next token. The auxiliary heads produce tokens t+2, t+3, and so on. A verification step checks these candidates against the main model's actual predictions. Accepted tokens skip the full forward pass. Rejected tokens get recomputed. When the draft acceptance rate is high (which it tends to be for well-trained MTP heads), effective tokens per second can increase by 1.5-2x.

DeepSeek-V4-Pro

DeepSeek has been pushing MoE harder than almost anyone. DeepSeek-V3 introduced auxiliary-loss-free load balancing and fine-grained expert segmentation (more experts, smaller each). DeepSeek-V4-Pro continues this direction with what appears to be a larger expert count and more sophisticated routing. The details in the model card are sparse, but the architectural lineage is clear: shared attention layers with MoE feed-forward blocks, potentially using the multi-head latent attention (MLA) mechanism from V3 that compresses the KV cache by projecting keys and values into a lower-dimensional latent space.

MLA computes:

c_t = W_dkv · h_t

where c_t is the compressed latent (much smaller than the full key/value dimensions), and the actual keys and values are reconstructed as:

k_t = W_k · c_t, v_t = W_v · c_t

This compression drastically reduces the KV cache memory footprint, which is often the binding constraint for long-context inference on consumer GPUs.

GGUF quantization from Unsloth

Unsloth's GGUF releases of both Qwen3.6 variants are where the rubber meets the road for local deployment. GGUF is the file format used by llama.cpp (and derivatives like ollama and LM Studio). It supports a range of quantization schemes from Q8_0 down to IQ2_XXS.

The practical choices for these models:

Q4_K_M (4-bit quantization with mixed precision for important tensors) is the sweet spot for most use cases. Quality loss relative to FP16 is typically 0.5-1% on standard benchmarks. For Qwen3.6-35B-A3B, a Q4_K_M quantization brings the file size to roughly 12-14GB, fitting comfortably in a single RTX 4090's 24GB VRAM with room for the KV cache.

Q2_K and IQ2 variants push further, cutting file sizes by another 30-40% but with more noticeable quality degradation. These make sense if you are VRAM-constrained (say, an RTX 3090 at 24GB running a longer context window) and willing to accept some loss.

The key point: because MoE models only load a fraction of experts per token, the memory bandwidth bottleneck is less severe than for dense models of equivalent total size. The quantization penalty is also somewhat amortized, since the inactive experts' quantization errors don't compound into the current token's computation.

Smaller models in the ecosystem

The other releases round out the picture. MiniCPM-V-4.6 from OpenBMB is a compact vision-language model, relevant if you need multimodal capabilities without a 70B-class VLM. Zyphra's ZAYA1-8B targets the 8B dense class that has become the default "small but capable" tier. Sulphur-2-base and circlestone-labs/Anima are newer entrants whose model cards are thin on architectural details, but their presence on HuggingFace signals the continued expansion of the open-weight ecosystem.

Comparison & Analysis

The most instructive comparison is Qwen3.6-35B-A3B versus a dense 7B model like Llama-3.1-8B or Mistral-7B.

Compute per token is similar: 3B active parameters versus 7-8B. So raw FLOPS are actually lower for the MoE model. But quality is substantially higher. On standard benchmarks (MMLU, HumanEval, GSM8K), MoE models with 3B active parameters from a 35B total pool typically score 10-20 percentage points above dense 3B models and competitive with or above dense 7B models. The extra capacity in the inactive experts provides knowledge that the routing network can selectively access.

The tradeoff is memory. You need to load all 35B parameters into VRAM even though only 3B compute per token. In FP16, that is roughly 70GB. In Q4_K_M GGUF, about 13GB. Compare with Llama-3.1-8B in Q4_K_M at roughly 4.5GB. So you need about 3x the VRAM for the MoE model, but you get meaningfully better output quality at similar compute cost.

Against DeepSeek-V4-Pro, the comparison shifts. DeepSeek's MoE models have historically used more experts with smaller individual sizes (fine-grained segmentation), which improves load balancing and reduces the variance in expert utilization. The tradeoff is slightly more overhead in the routing computation. If DeepSeek-V4-Pro follows V3's pattern of 256 routed experts with 8 active, the routing granularity is finer than Qwen's approach, which can mean better specialization but also more sensitivity to distribution shift between training and deployment data.

For the MTP variants, speculative decoding speedup depends on the acceptance rate of draft tokens. In practice, MTP-trained models see 1.4-1.8x speedup on greedy decoding and 1.2-1.5x on sampling with moderate temperature. This is not as dramatic as the 2-3x sometimes claimed in papers, but it is a free speedup on top of the MoE compute savings.

Practical Implications

If you are deploying models locally or on-prem, the decision tree now looks different than it did six months ago.

For a single RTX 4090 (24GB VRAM), Qwen3.6-35B-A3B in Q4_K_M is viable. You get roughly 13GB for weights, leaving 11GB for KV cache and activations. At 4K context length with MLA-style compression or standard grouped-query attention, this works. At 32K context, you will need to drop to Q2_K or IQ3 quantization, or offload some layers to CPU RAM (which hurts token/s).

For an RTX 3090 or 4080 (also 24GB but slower memory bandwidth), the same model works but expect lower tokens/second. The MoE routing means memory bandwidth matters more than pure compute, since expert weights must be fetched from VRAM on each token. GDDR6X on the 4090 has about 1TB/s bandwidth; GDDR6X on the 3090 has 936GB/s. The difference is small enough not to change the feasibility, but you will notice it in throughput.

For Mac Studio or MacBook Pro users with unified memory, the calculus is different. You have more total memory (64GB or 128GB configurations) but much lower bandwidth (400GB/s on M2 Ultra, 800GB/s on M4 Max). MoE models are bandwidth-bound, so expect 5-15 tokens/second on the Q4_K_M quantization of the 35B-A3B model. The dense 27B variant will be slower per token (more active compute) but has no expert-loading overhead, so actual throughput may be similar.

Deployment tooling is mature. llama.cpp, ollama, and vLLM all support MoE models with expert parallelism. For GGUF files, llama.cpp is the most straightforward path. For production serving with multiple concurrent requests, vLLM's expert parallelism and continuous batching will give better GPU utilization, though you will need to convert from GGUF to the framework's preferred format.

One gotcha: MoE models have less predictable latency than dense models. If a batch of tokens all route to the same expert, that expert becomes a bottleneck. In practice, with top-2 routing from 64+ experts, the variance is manageable, but it is worth profiling your specific workload. If you see latency spikes, expert padding or expert limiting in vLLM can help.

Fine-tuning MoE models is still awkward. Standard LoRA applied to all experts defeats the sparsity benefit. Expert-specific LoRA (training adapters only on frequently activated experts) works but requires custom training loops. For most practitioners, full fine-tuning of these models is out of scope on consumer hardware; LoRA on the shared attention layers is the practical path, and it works reasonably well for style and format alignment even if it won't teach the model new factual knowledge.

References