Appearance
Stop scrolling GitHub trending for shiny new LLM frameworks. 90% of them will be abandoned in 12 weeks. This is the set that people are actually deploying this month. None require you to rewrite your entire stack. None ask you to join a Discord for documentation.
The quiet distillation pattern everyone is copying
The most important thing that shipped this month was not a new 1T parameter model. It was a throwaway hackathon project from Hugging Face, and it demonstrates the pattern that will replace most wrapper agents this year.
The Job Searcher does one thing well. You upload a resume. It generates search queries, pulls jobs from LinkedIn, and scores every match across 5 consistent dimensions with written reasoning.
It does not call GPT-4o at inference. It does not use any paid APIs. The authors ran DeepSeek V4 Pro once, offline, to label 10,000 resume/job pairs. They distilled that consistent judgement into Qwen3-8B with two small LoRA adapters. The final model runs on a shared ZeroGPU slice. Cold start is under 3 seconds. Per user cost is effectively zero.
Two critical implementation details that every engineer should note:
- They used two separate LoRA adapters on the same base model, one for query generation and one for fit scoring. Attempts to combine both tasks into a single adapter caused consistent formatting leakage across tasks. This is a general solution to a very common fine tuning bug.
- Teacher prompt quality improved output more than doubling the student model size. When they rewrote the labelling prompt to reference specific line items from the resume instead of generic match statements, the small student model adopted exactly the same behaviour.
You will see this pattern repeated hundreds of times this year. You do not need a frontier model at inference. You need it once, offline, to teach a small model how to make the exact judgement you care about.
Whisper just got the update everyone waited for
Whisper is still the most widely deployed open source ML model in the world. This month it got the update that makes it production ready for almost all use cases.
The new turbo model is 809M parameters, runs in 6GB VRAM, and delivers 8x the inference speed of large-v3 with less than 1% relative degradation in WER. This is the new default. No one will run large-v3 anymore.
| Model | Parameters | Required VRAM | Relative speed |
|---|---|---|---|
| tiny | 39 M | ~1 GB | 10x |
| base | 74 M | ~1 GB | 7x |
| turbo | 809 M | ~6 GB | 8x |
| large-v3 | 1550 M | ~10 GB | 1x |
There is one critical fine print note: turbo is not trained for translation. If you need to translate speech to English, use the medium model. Everything else uses turbo.
Three years after release, tiny.en still beats every other 100M parameter ASR model ever built. No other ML project has ever had that kind of staying power.
VibeVoice changed the long-form ASR game
This is the first real challenger to Whisper, and it wins cleanly on the use case that matters most for most teams: long form meeting audio.
VibeVoice ASR accepts 60 minutes of continuous audio in a single pass. No chunking. No sliding window. No diarization drift across hour long conversations. It outputs structured transcripts with speaker, timestamp and text in one inference run. It supports user provided hotwords to drastically improve accuracy on domain specific terms. It runs on vLLM. It was merged into mainline Transformers this month.
Every meeting transcription service will be running this model by the end of the quarter.
Note that Microsoft pulled the original TTS weights after people started generating convincing impersonations within 72 hours of release. The ASR model remains fully available, unencumbered, and MIT licensed.
Khoj is the personal AI that doesn't suck
Most personal AI tools are either closed, require 32GB of VRAM, or break every two weeks. Khoj works.
It runs fully locally. It integrates natively with Obsidian, Emacs, desktop, browser, and even Whatsapp. It works with every LLM you already use: Llama, Qwen, GPT, Claude, Gemini, Mistral. It does not phone home. It does proper incremental semantic indexing. It will not forget documents you added last week.
Most importantly: it does not try to be your friend. It just finds your stuff and answers questions about it.
If you have been looking for something to replace ChatGPT for personal use, this is it. You can set it up in 10 minutes.
Spec Kit killed vibe coding for AI agents
GitHub dropped this tool two weeks ago and almost everyone slept on it. It is the single most useful addition to AI assisted development released this year.
Spec Driven Development does not replace your coding agent. It adds guardrails that stop it from producing garbage. It formalizes a sequence that every good engineer was already doing manually:
- Write project constitution: quality standards, testing rules, hard constraints
- Write functional specification: what the thing does, not how
- Write technical plan: tech stack, architecture, boundaries
- Generate discrete implementation tasks
- Execute tasks against the plan
It ships as a set of standard commands that work with 30+ coding agents including Claude Code, Copilot Workspace, Cursor, Gemini and Codex.
It adds 5 minutes to the start of every project. It removes 4 hours of cleanup afterwards. Every engineer using AI to write code should install this today.
Stop building agent boilerplate. Use these templates.
Awesome LLM Apps is now the standard cookbook for LLM application development. It is not a curated list. Every template is built, tested, and maintained by the repo maintainers.
You can clone and run a working travel agent, earnings call analyst, insurance voice agent, or research agent in 3 commands. There are no broken requirements.txt files. There are no "left as an exercise for the reader" gaps. Every template has a full step by step tutorial.
All templates are provider agnostic. Swap OpenAI for Llama for Qwen for DeepSeek with one line change in a config file. Everything is Apache 2.0 licensed. You can ship anything you build from these templates commercially with no restrictions.
This repo will save you more time than any agent framework released this quarter.
MCP is not a toy anymore. Context Forge is the gateway you need.
Everyone is talking about the Model Context Protocol. No one is talking about how to run it at scale.
Context Forge from IBM is the first production grade MCP proxy. It solves all the boring hard problems everyone ignores:
- Federate hundreds of MCP servers behind a single endpoint
- Centralized auth, rate limiting, and retry policies
- Automatic translation of existing REST and gRPC APIs to MCP tools
- Full OpenTelemetry tracing and token usage metrics
- SSRF protection
- Kubernetes deployment with proper auto scaling
If you are planning to deploy agents that use more than 2 tools, you need this. Stop pasting MCP server configs into every user's Claude config. Run one gateway.
The unifying pattern across all of these
None of these tools are trying to build a new foundation model. None of them are claiming AGI. Every single one solves one specific, boring, real problem that engineers actually have.
That is the shift we are seeing right now. The era of demo toys is over. The hype cycle has crashed. The engineers who stayed are building tools that work. They are small. They are composable. They have clear tradeoffs. They don't lie about what they do.
What not to deploy
Do not run VibeVoice TTS. The weights are pulled and it will be misused. Do not use any agent framework that shipped in the last month. Do not run Whisper large-v3 anymore. Do not build your own RAG pipeline from scratch. There is no reason to.
None of these tools will make the front page of TechCrunch. All of them will make you more productive next week. That is the part that matters.