Skip to content

This Month In Local LLMs: Cohere Goes Open, MTP Quantization Lands, And Everyone Waits For Qwen 3.7

#local-llm #open-source-llm #quantization #mtp #qwen #cohere #llama.cpp

Cohere finally delivered on their open model promise

Three months ago Cohere got dragged openly on /r/LocalLLaMA for promising open models and delivering nothing. This week they showed back up, apologized, and dropped Command A+.

This is not a research throwaway. It is Cohere's first MoE architecture, built explicitly for deployment. It runs usable 8k context on a single 16GB GPU, and will fit 32k on 24GB. Most importantly it is released under full Apache 2.0 license. No usage restrictions, no fine print, no royalties.

Cohere did something almost no other vendor does: they optimized for small teams first. They did not just dump raw bf16 weights. They shipped pre-tested quantizations, verified agent tool calling performance, and tested the model end to end on consumer hardware before release.

This is a shot directly across Meta's bow. For anyone building production self hosted systems, license terms matter more than 2 points on MMLU. Until this week there was no competitive 20-30B class model available without commercial license restrictions. That changed on Tuesday.

MTP quantization is not a meme anymore

For the last two years every local LLM engineer operated on one unchallenged rule: run the smallest quant that fits your VRAM. That rule is dead.

ByteShape published full cross hardware benchmarks for Qwen 3.6 35B this week, testing both standard NTP and the new MTP quantization variants across 7 consumer GPUs and 4 CPU architectures. The results are unambiguous.

On GPU, MTP delivers consistent 20-40% generation speedup. There is no measurable quality degradation on real workloads for quants down to 3.5 bpw.

There are hard tradeoffs. MTP uses 10-15% more VRAM. Speedup varies wildly by prompt length and generation pattern. It provides zero benefit on CPU, and actually runs slower on ARM devices including Raspberry Pi.

The biggest surprise was not MTP itself. For standard NTP quants, the largest variant that would fit memory almost always won on both quality and speed. You should no longer minimize bits per weight. If it fits, run the larger quant. It will be faster.

The team also explicitly excluded MMLU from this comparison. Qwen 3.6 has known answer format bias that makes MMLU scores completely useless for quantization testing. Almost no one publishing benchmarks admits this.

Qwen 3.7 is the elephant in every room

Right now half the /r/LocalLLaMA front page is people waiting for Qwen 3.7. This is not normal hype.

Early benchmark results for the full 122B Qwen 3.7 Max place it 5th globally, effectively tied with GPT 5.4 xhigh and one notch above Gemini 3.5 Flash.

Nobody cares about the 122B variant. Everyone is waiting for the 27B and 35B cuts. Qwen 3.6 27B scored exactly 6 points below the 122B Max variant. If Qwen 3.7 holds that same ratio, we will have a model that runs on a single 24GB GPU within 6% of the current closed frontier.

That has never happened before. That would change every assumption about building with LLMs.

There is no confirmed release date. Roadmap comments from Qwen engineering confirm the models are complete. They are just finalizing quantization and distribution. People are refreshing Hugging Face every 10 minutes. It will probably drop before this article is 48 hours old.

Tooling finally stopped building for researchers

Two tiny unremarkable changes landed this week that will save engineers thousands of hours.

Hugging Face added a model size filter to their benchmark leaderboards. You can now pull up SWEbench Verified results and show only models under 32B parameters. This was the single most requested feature for two and a half years.

For context: 98% of all deployed open LLMs are under 32B. Until this week every public leaderboard was sorted first by 400B+ research models that no one will ever run in production.

Unsloth also hit Github trending this week with their new Studio UI. It is a single interface to download, run, fine tune, and export models. No separate runtime, no config files, no command line arguments. It works for every major model family out of the box.

We are finally past the phase where open LLM tooling was built exclusively for people publishing papers. Now it is being built for people shipping products.

The quiet shift no one is talking about

None of this is the most important change happening right now.

Six months ago almost all meaningful progress on local LLMs came from independent contributors working in their spare time. This month:

  • Cohere engineers are answering support questions on reddit
  • Nvidia engineers submitted the MTP backend sampling PR to llama.cpp
  • ByteDance quantization teams are publishing independent cross hardware benchmarks
  • Every major vendor is lurking in the same discord channels and reading the same comment threads

Companies are no longer just dumping weights over the wall. They are contributing to the shared runtime. They are adopting community standards. They are changing their roadmap based on anonymous comments from random engineers.

This is not open source as a marketing gesture. This is now the default way good models are shipped. Any vendor that does not participate will be left behind.

What you should run this week

If you need something stable for production today, run Qwen 3.6 35B A3B NTP. It is still the best general purpose model available at any size right now.

If you have 30GB or more VRAM, test the MTP variant. For long generation workloads the speed difference is noticeable enough to be worth the memory tradeoff.

If you need a permissive commercial license, switch to Command A+. It is slightly behind Qwen on raw benchmarks but good enough for almost all agent use cases, and you will never have to talk to a Meta lawyer.

Wait one week before you start any new fine tuning job. Do not waste 100 GPU hours training a model that will be obsolete the second Qwen drops.

This is the best there has ever been to build with LLMs. Nothing is locked down. All the hard parts are getting solved every week. Nobody knows what will be standard next month. That is the good part.