Skip to content

What Actually Matters On Hugging Face Right Now (June 2026)

#hugging-face #ml-datasets #fine-tuning #small-language-models #agent-ai

The quiet shift on Hugging Face right now

Nobody is posting benchmark leaderboards any more. For three years every major Hub release led with a big table of MMLU and GSM8K scores. That stopped two weeks ago. Right now the top downloads are not models with 90% on some synthetic test. They are datasets. Working, cleaned, validated datasets that people are actually fine tuning on. This is not an accident. We have hit the point where model architecture stopped mattering. Data quality is the only remaining lever that reliably moves the needle. This article breaks down the releases that actually matter from the last two weeks. No press releases. No marketing. Just things you can pull and use in production today.

MiniCPM5 and the SFT dataset that changes small models

OpenBMB dropped two things last week. First MiniCPM5-1B, a 1B parameter dense transformer that is now the clear SOTA for on device models. More importantly they dropped the exact dataset they used to train it: UltraData-SFT-2605. This is unprecedented. Almost no one releases their actual production SFT data. Everyone releases models. Nobody shows you what made them good. UltraData has 15 million samples. It went through six filtering steps. Most importantly every single sample was validated by actually training a model on it. They did not filter with a judge LLM. They ran 20 billion token test fine tunes, measured actual capability gain, and threw out any data slice that did not move the needle. They also ran full benchmark decontamination. Not the lazy string matching everyone uses. They trained a model on each slice, ran it against every public benchmark, and threw out any slice that produced inflated scores. This is the only public dataset we know of that does this properly. You will not get fake benchmark gains when you train on this data. This is how good SFT data is made. Almost nobody does this. Everyone just scrapes ShareGPT and runs it through GPT-4o for scoring. You can load the whole thing in one line. It is Apache 2 licensed. It already has 8096 downloads in 7 days.

Deep thinking vs non-thinking data: the split nobody talks about

UltraData has one other feature that will become standard within six months. Every domain is split into two separate splits: think and no_think. Non thinking data trains the model to give short, fast, correct answers. No reasoning chains. No verbose preamble. Just the answer. This is what 90% of end users actually want 90% of the time. Deep thinking data trains the model to do slow, multi step verification, decomposition and error checking. OpenBMB did not mix them. They did not weight them. They released them as separate splits. You choose what ratio you want for your use case. This is the single most important fine tuning insight of the last year. Almost every bad model behaviour comes from training on a bad mix of these two data types. Models that ramble? Too much thinking data. Models that hallucinate answers? Too little. Before this release everyone just dumped everything into one bucket. Now you have an explicit control.

Agent traces are now the most valuable datasets on the hub

Three separate agent trace datasets dropped in the last 10 days. Combined they have over 2.2 million rows. AgentTrove is the largest at 1.7 million rows. It is 4x bigger than the previous largest public agent corpus. Every row is a full successful or failed agent trajectory, with tool calls, environment output, and final outcome. Then there is the Qwen 3.7 Pi traces: 47 full agent sessions captured directly from the Pi coding agent runtime. This is not synthetic data. These are real traces from a working agent. And finally the Claude Opus Trace Inversion dataset. 9000 examples where they took Claude Opus final answers and reverse engineered the exact reasoning chain that produced it. All three agent datasets work natively with Unsloth and TRL. There is working example code in each readme that will start a fine tune in 30 seconds. You do not need to write any parsing code. You do not need to clean anything. It just works. None of these datasets existed 30 days ago. Before this if you wanted to fine tune an agent you had to generate your own traces. Now you have enough data to train a production grade agent on a single 4090 over a weekend.

The math dataset arms race

For a long time everyone used GSM8K. Then MATH. Then AIME. That is over. amphora/ResearchMath-14k dropped last week. It has 14000 actual unsolved research mathematics problems. Not textbook exercises. Not competition problems. Open problems pulled directly from arXiv papers and workshop problem lists. Every entry has full taxonomy, status, citations, and original source links. 31% are confirmed open. 22% are confirmed solved. The rest are unknown. This is not a benchmark. This is training data. You do not test on this dataset. You train on it. Models fine tuned on this dataset do not just get better at math. They get better at knowing when they do not know the answer. They stop hallucinating proofs. They stop making up lemmas. This is the first dataset that actually trains intellectual honesty.

What Nvidia is actually shipping (not announcing)

Nvidia did not put out a press release. They just uploaded LocateAnything-3B to the Hub. It is a 3B parameter model that can find any object in any image. Any object. No fine tuning. No prompt engineering. You type "find the broken bolt" and it draws a bounding box. It runs at 120fps on an RTX 4090. It runs at 18fps on a Raspberry Pi 5. This is the best computer vision model released this year. Almost nobody has noticed it yet. There is no blog post. No benchmark table. Just the model weights, and a working demo. This is how Nvidia ships production software now. They do not announce things. They just put them on Hugging Face.

The death of generic pre-training data

FineWeb was the standard pre-training dataset for 18 months. Everyone used it. It was good enough. Last week OpenBMB uploaded Ultra-FineWeb-L3. It is 1% the size of FineWeb. It will outperform it on every downstream task. Every single document in this dataset was filtered by training a model on it, and measuring loss reduction. Documents that did not reduce model loss were thrown out. No heuristic filtering. No LLM scoring. Actual training signal. This is the end of the big crawl era. Throwing more tokens at a model stopped working 6 months ago. Now we are throwing bad tokens away. The best pre-training dataset in the world right now is 300 million tokens. Not 30 trillion. That is the shift nobody is talking about.

Production models you can deploy this week

These are the models released in the last two weeks that you should actually deploy:

  • MiniCPM5-1B: deploy this on every edge device. It beats every other 1B and most 3B models.
  • LocateAnything-3B: replace every object detection model you are running right now.
  • LongCat-Video-Avatar 1.5: Meituan's talking head model. Runs in real time on consumer GPUs. Better quality than anything from OpenAI or Google.
  • LiquidAI LFM2.5 8B: the best general purpose 8B model right now. No alignment garbage. Fast.

The unspoken rules of good Hub releases right now

There is a new standard for good releases on the Hub now. You can spot the good ones immediately:

  1. They release the dataset before the model.
  2. They show you exactly what filtering pipeline they used.
  3. They do not lead with benchmark scores.
  4. They include working one line load code.
  5. They do not ask you to email them for weights. If a release does not do all five things, ignore it. It is marketing garbage. This standard did not exist three months ago. It is universal now. The community has voted with their downloads.

What this means for the field

We have left the era of model architecture innovation. That part is done. Every single important release on Hugging Face right now is data. Every single gain comes from better filtering, better curation, better validation. Nobody is inventing a new transformer variant that will give you 20% better performance. But you can get that 20% tomorrow by fine tuning on one of the datasets listed here. This is good news. ML is no longer the domain of people who write CUDA kernels. It is now the domain of people who curate good datasets. The barrier to entry just dropped by an order of magnitude. And almost nobody has noticed yet.