Skip to content

Gemma 4: What Google Actually Shipped, And What Everyone Is Talking About

#gemma #foundation-models #multimodal-ai #local-llm #llama.cpp #mixture-of-experts

Google shipped Gemma 4 this week. It is not another incremental open model drop. This is the first major open foundation model line built from the start for both server and on-device deployment, and it makes very specific tradeoffs that almost no one is reporting correctly.

The full model line

This release shipped five complete models, all under Apache 2.0 license, with both pre-trained and instruction tuned variants.

SizeArchitectureContext WindowSupported ModalitiesTarget Hardware
E2BDense128kText, Image, Audio, VideoHigh end phones
E4BDense128kText, Image, Audio, VideoEntry level laptops
12BDense256kText, Image, Audio, Video16GB consumer laptops
26BMoE A4B256kText, ImageServer / 24GB GPUs
31BDense256kText, ImageServer / 40GB GPUs

All models support 140+ languages, native function calling, and follow the exact same prompt template. There are no hidden fine print restrictions on commercial use.

The encoder-free trick everyone is missing

This is the single most important architectural change in this release, and almost every coverage piece skipped it entirely.

Every production multimodal model released before Gemma 4 used separate dedicated encoders. For images you ran SigLIP or CLIP. For audio you ran Whisper or a similar speech encoder. Those encoders output embeddings which got projected and injected into the LLM context. Those encoders add 2-4GB of permanent memory overhead, add 50-100ms latency per input, and sit completely idle any time you are just generating text.

Gemma 4 12B removes them entirely.

Vision input uses one single linear matrix multiplication, positional embedding, and normalization. That is all. No convolution layers. No transformer blocks. No separate vision model. Audio processing goes one step further. Raw mel spectrogram frames are projected directly into the LLM embedding space with no intermediate processing at all.

All visual and audio reasoning is performed by the exact same transformer backbone that does text. This is why the full multimodal 12B model fits into 7.8GB of VRAM when 4bit quantized. This is why it generates text 2x faster than Llama 3.2 11B Vision on identical hardware.

Benchmark positioning

Official and third party benchmarks place the 12B instruction tuned model within 2% of Llama 3.1 28B on general reasoning tasks. It beats Llama 3.2 12B across every single published benchmark.

BenchmarkGemma 4 12BLlama 3.2 12BLlama 3.1 28B
MMLU78.174.979.8
GSM8K72.368.174.0
MATH41.235.743.1
HumanEval31.229.737.3

It is still well behind DeepSeek V2 16B on coding. No argument there. But for general purpose agent work, multimodal tasks, and long context work this is currently the best model you can run on a 16GB laptop.

Native system prompts are not a minor feature

Almost no one commented on this line in the model card. Gemma 4 has hardcoded native system prompt support.

You do not inject system messages into the first user turn. You do not wrap them in special tokens. There is an explicit separate embedding space for system messages, handled at the lowest level of the model forward pass.

This is not a quality of life change. This fixes the single most consistent failure mode of all existing instruction tuned models: system prompt leakage, ignored instructions after 10 turns, and trivial prompt injection. Early independent testing shows system prompts hold perfectly even at 220k tokens into the context window. No other open model does this right now.

MoE implementation details

The 26B A4B model is an activation MoE with 4 experts per layer, 1 active per token. That means actual compute per generated token runs against 6.5B parameters. Not 26.

Google did not hide this. They also did not advertise it. Most people reading the model card still think this is a 26B dense model. It is not. It is effectively a 6.5B inference model with 26B of stored knowledge.

This is an extremely good tradeoff for server deployments. You get near 30B model knowledge at faster inference speed than the 12B dense model. It will become the default server endpoint model for most teams that do not need full 70B class performance.

What the community already found

48 hours after the official release, llama.cpp merged PR #24077. The PR has no description. No announcement. Just code.

That code implements support for an unannounced Gemma 4 Unified variant. Comments in the code confirm this model has no separate vision projection layer at all. Text tokens, raw image pixels, and raw audio samples all share the exact same embedding table. There is no distinction between modalities at the lowest level of the model.

Google has not mentioned this variant anywhere in official material. It will almost certainly be released publicly within 30 days. This is not an incremental improvement. This is the first general purpose foundation model built without hardcoded assumptions about what kind of input it will receive.

The 124B elephant in the room

Within 12 hours of release the entire LocalLLaMA community had coordinated on one single demand. Everyone wants the 124B variant.

All benchmarks and internal leaks confirm this model exists. It was run during internal testing. Google deliberately held it back. The thread requesting release is currently the highest voted post in the seven year history of the subreddit.

This is not entitlement. The open model ecosystem has hit a very clear inflection point. Every vendor now releases small and medium models, and holds back the large flagship variant. For the first time the community is explicitly, collectively pushing back on this pattern. It will be very interesting to see if Google responds.

Deployment status right now

This was the best supported model launch in history. At time of writing, 72 hours after release, you can run Gemma 4 12B on:

  • Ollama with one line install
  • LM Studio one click download
  • llama.cpp merged day zero
  • MLX for Apple Silicon
  • vLLM and SGLang for server deployment
  • Unsloth already has full 4bit fine tuning support available

GGUF quantizations were uploaded 90 minutes after the official Hugging Face release. No other model has ever had full ecosystem support on launch day.

Tradeoffs and limitations

This is not a perfect model. You should know the flaws before you build anything on it.

Audio input only works reliably up to 10 minutes right now. There is no native video output. Coding performance is acceptable but not class leading. The MoE model has very bad long tail performance on rare knowledge tasks. All models have standard Google alignment bias, for better and worse.

Most importantly: the 12B model will hallucinate visual details. A lot. Because it has no dedicated vision encoder it will confidently invent parts of images that it does not have high confidence on. This is the core permanent tradeoff for the tiny memory footprint.

What this actually changes

Gemma 4 has reset the baseline for open models.

Every open model released from this point forward will be judged against a simple standard: can it run on a 16GB laptop, support image and audio, have 256k context, and not suck at reasoning. Before this week that bar did not exist.

Google did not do this out of charity. They are racing to establish Gemma as the default open foundation model before Llama 4 lands. Whatever the motivation, this is the best general purpose open model release we have had in 18 months.

If you are evaluating models for local deployment, agents, or edge use cases you should stop what you are doing and test this one this week.