Appearance
This is not a list of demo projects. Every release covered here has working code, published benchmarks, and is already being deployed by teams this month.
We have passed the peak of foundation model mania. No one released a 1T parameter model this month. No one claimed state of the art on MMLU. Every significant open source ML release shipped infrastructure, tooling, datasets, or agent runtimes. This is what production ML actually looks like.
The quiet shift in open source ML
For three years every major open source release was a better general purpose LLM. That era ended this quarter.
Teams are no longer building better models. They are building the layers that sit between your code and the model. Compression. Context routing. Agent steering. Benchmarks that actually measure failure modes. Datasets you can legally use.
None of this makes good Twitter threads. All of this is what you will be using at work for the next two years.
GPIC: The first usable permissive image training corpus
Stanford Vision Lab dropped the most important release this month and almost no one noticed.
GPIC is 100 million captioned images, 28 trillion total pixels, deduplicated, safety filtered, and licensed for commercial use. No research only restrictions. No non-commercial clauses. No ambiguous scraping provenance.
Every generative image model built in the last four years was trained on LAION. Every company shipping commercial image generation has been operating with unresolved legal risk. GPIC fixes this.
The dataset is split cleanly into 100M train, 200K validation and 1M test examples. Captions were generated with a state of the art VLM, not scraped alt text. All files are hosted natively on Hugging Face. It was downloaded 34,118 times in the first 10 days.
Reference flow matching baselines are provided. You can train a production grade image generator on this dataset next week and never have to talk to a lawyer. This is the end of LAION.
Headroom: Context compression that actually works
Headroom is a context compression layer for agents. It sits between your agent and the LLM. It compresses tool outputs, logs, RAG chunks, files and conversation history. Same answers, fraction of the tokens.
This is not another prompt trimming hack. Headroom maintains zero accuracy loss on GSM8K. It improves scores on TruthfulQA by 3%. On real workloads it delivers 92% token reduction for code search and incident debugging, 73% for issue triage, 47% for codebase exploration.
It works as a library, a drop in proxy, or a wrapper for every popular coding agent. You can install it, wrap Claude Code, and cut your LLM bill by 80% today with zero code changes.
Most importantly it is reversible. Original context is stored locally. The LLM can retrieve uncompressed content on demand if it needs it. No other compression tool does this.
This is the first agent infrastructure tool that is an unambiguous upgrade for every production deployment. You should be running this.
Holo 3.1: Local computer control agents are now real
Holo 3.1 is the first computer use agent that works well enough to deploy.
Previous generation computer control agents worked great on benchmarks and failed completely in production. Holo 3.1 fixes distribution shift across browsers, desktops and mobile. It scores 79.3% on AndroidWorld, up from 67% for the previous release. The 9B variant hits 72% on the same benchmark.
For the first time quantized checkpoints are provided. NVFP4 quantized 35B checkpoints run at 1.74x the throughput of full precision BF16 with only 2 points of performance loss on OSWorld. End to end step time is now 3.3 seconds.
You can run this agent fully locally on consumer hardware today. It will control your browser, your desktop, or your phone. No cloud API calls required.
This release crossed the threshold from research curiosity to usable tool. Teams are already building internal automation on top of it.
Surya 2: The new baseline for open source OCR
Surya 2 is a 650M parameter document intelligence model. It is now the default open source OCR choice for production systems.
It scores 83.3% on olmOCR-bench, the highest score for any model under 3B parameters. It runs at 5 pages per second on an RTX 5090. It supports 91 languages. It does layout analysis, reading order, and table recognition all through a single VLM.
All three tasks share the same inference backend. You run one server, not three separate models. Output includes proper HTML for tables and LaTeX for equations, no post processing required.
The previous open source OCR baseline was PaddleOCR. Surya is better in every measurable dimension. It is also much easier to deploy. If you are running OCR in production you should migrate.
Production Agentic RAG: The course that skips the hype
99% of RAG tutorials are garbage. They teach you to throw vectors into a database and call it production. This course teaches you how actual teams build RAG systems.
It is structured as a 7 week project building an academic paper research assistant. It does not start with vector search. It starts with infrastructure, data pipelines, and BM25 keyword search. Vectors are added in week 4 as an enhancement, not the foundation.
Every week has working production grade code. No notebooks that only run on the author's machine. The stack uses FastAPI, PostgreSQL, OpenSearch, Airflow, Redis and Langfuse. This is exactly the stack every mid sized company is deploying right now.
If you want to learn to build production RAG, this is the only material you need. Ignore everything else.
Open spatial reasoning: The benchmark that breaks every VLM
ReasonCore released a small brutal benchmark for spatial reasoning. Every frontier VLM fails it.
The benchmark presents monocular driving images with bounding boxes and asks simple 3D reasoning questions. Every existing VLM gets ~50% accuracy, same as random chance. They all rely on flat image shortcuts: lower in the frame is closer, bigger box is nearer, left in the image is to my left. None actually reason about 3D space.
This is the single most informative VLM benchmark released this year. It does not measure how well models can regurgitate facts. It measures what they actually understand about the physical world. Right now the answer is nothing.
Test your favourite VLM on this dataset. It will disappoint you.
Specialized agent frameworks
Three very different agent frameworks shipped stable releases this month. None of them are general purpose agent runtimes. All of them solve one specific problem very well.
TradingAgents implements a full multi agent trading firm. It has separate analysts, researchers, traders, risk managers and portfolio managers. Agents debate positions. They remember previous decisions and learn from mistakes. It supports every major LLM provider and works for every global market. This is not a trading bot. It is a research framework for testing multi agent decision making.
Deep Eye is an AI powered penetration testing tool. It orchestrates 10 different LLM providers to generate payloads, scan for 45+ vulnerability types, filter false positives and generate compliance mapped reports. It bypasses Cloudflare. It will not replace human penetration testers. It will replace 80% of the boring repetitive scanning work.
Open-LLM-VTuber passed 100,000 installs. It is a fully offline voice interactive Live2D companion. It supports visual perception, voice interruption and touch feedback. It works on every operating system. This is currently the most widely deployed end user agent application in the world. No one saw that coming.
AWS AI-DLC: Standardized agent steering files
AWS released AI-DLC, a standardized workflow for agent assisted development. It is a set of rule files that work identically across Claude Code, Cursor, Cline, Amazon Q, Copilot and Kiro.
You drop one set of files into your repository and every coding agent will follow the same process, quality standards and guardrails. No more writing separate instructions for every agent.
This is the first attempt at a standard interface for steering agents. Every company running agent assisted development will adopt something like this. Right now this is the best implementation available.
What changed this month
We crossed an invisible line. Open source ML is no longer about models. It is now about everything else.
All of the hard problems left are not model problems. They are context management problems. Agent steering problems. Licensing problems. Benchmarking problems. Deployment problems.
This is good news. Most ML engineers do not train foundation models. Most ML engineers build systems. For the first time in three years the open source community is building the tools you actually need.
What to watch next
Over the next 90 days expect to see:
- Every major image generator retrain on GPIC
- Headroom integrated into every major agent framework
- Holo 3.1 derivatives shipping in end user products
- Surya 2 replace every other open source OCR implementation
None of this will make headlines. All of this will change how you build ML systems.