Six LLM Training & Scaling Papers That Changed The Rules This Month

If you run LLM training clusters, you have probably spent the last two weeks re-running spreadsheets, throwing out old tuning scripts, and arguing about whether that weird loss rollover you saw last quarter was not a bug after all. Six papers dropped on arXiv in a 72 hour window that rewrite almost every foundational rule we have been operating on for the last four years.

Shannon scaling kills the monotonic power law

For seven years every LLM team operated under the assumption that loss follows a clean power law against compute, parameters, and tokens. Everyone saw the exceptions. Everyone ignored them, wrote them off as bad hyperparameters, bad data, implementation bugs.

This paper does not tweak existing scaling law exponents. It throws the entire model out. They model training as transmission over a noisy channel. Parameters are channel bandwidth. Training tokens are signal power. There is a hard capacity limit. Once you cross it, adding more parameters or more tokens makes performance worse. Full stop.

This is not theoretical. They validated this across Pythia and OLMo2 from 1B to 12B parameters. Fitted only on runs up to 6.9B parameters and 180B tokens, their model predicted the 12B model loss out to 307B tokens with R²=0.847. Every existing monotonic scaling law failed completely here, predicting continuing improvement while actual loss started climbing.

Catastrophic overtraining is not an accident. It is not something you fix with weight decay. It is the fundamental expected behavior once you exceed the SNR limit of your training setup.

If you are currently planning a 100B parameter training run, stop. Read this paper first. You are almost certainly planning to run straight over the cliff.

Complete-muE ends MoE hyperparameter hell

Every team that has moved from dense transformers to MoE has hit the same wall. All your carefully tuned hyperparameters break completely. Learning rate, weight decay, initialization scales, everything. You end up running 20 expensive ablations for every new expert count, just to get back to baseline.

Complete-muE fixes this.

Existing transfer frameworks like μP and SDE only handled one variable changing at a time. MoE changes two at once: effective width, and the number of tokens seen per expert per step. Complete-muE adds two correction bridges: one to map dense FFN to dense MoE via normalized router scaling, and a second to adjust for expert activation sparsity. The residual drift after correction is bounded and consistent across all configurations.

The practical result is ridiculous. Tune hyperparameters once on a tiny dense reference model. Transfer them unchanged to any MoE configuration. Any number of total experts. Any top-k activation. Any group balance scheme. The observed drift in optimal hyperparameters is less than 10% across all tested configurations.

This is not an incremental improvement. This removes the single largest operational cost of running MoE training. You can stop burning 30% of your cluster compute on hyperparameter sweeps this week.

Muon optimizer has a proper theoretical foundation

Muon appeared out of nowhere three months ago, beat AdamW on every large transformer run anyone tried, and nobody understood why it worked. Everyone was using it anyway.

This paper gives it a formal grounding. It turns out regularized Muon is not just a clever heuristic. It is a mirror descent step on the Fenchel dual of the nuclear norm. Momentum is not an add-on, it is the dual coordinate in this formulation. The entire update rule is the discrete time step of a damped Hamiltonian flow over parameter distributions.

Most importantly, they prove exponential convergence under standard curvature assumptions. This is not empirical performance. This is a proof that this optimizer will converge faster and more reliably than AdamW for matrix valued parameters, which is exactly what every transformer weight is.

They also extend the formulation directly to MoE expert blocks. There will be no separate MoE optimizer. The same rule works unchanged.

You do not need a strong teacher for distillation

Everyone knows how distillation works. You train a big good teacher. You distill its knowledge into a smaller faster student. This has been received wisdom for five years.

It is wrong.

This paper ran controlled distillation experiments across every combination of teacher and student size from 1B to 16B parameters, and every training token count from 10B to 300B.

They found three consistent results:

Even small, undertrained teachers improve larger students. You can train a 1B model for 20B tokens, distill it into a 7B model, and get better downstream performance than training the 7B model from scratch on the same total compute.
There is a hard saturation point. Making the teacher larger or training it longer past this point gives zero additional distillation gain. Often it makes results worse.
Distillation almost never improves in-domain perplexity. It almost always improves out of distribution generalization.

Stop wasting compute training 70B teacher models for distillation. You are throwing away money.

Approximate attention hits the I/O lower bound

FlashAttention was the biggest practical advance in LLM inference for three years. Everyone assumed we were already near the theoretical limit for attention I/O cost.

We were not.

This paper proves that the Ω(n²) I/O cost that FlashAttention hits is not fundamental. It is an artifact of exact softmax attention. For approximate attention with bounded error, the theoretical lower bound is Ω(nd), linear in sequence length.

They built an algorithm that comes within a constant factor of this bound. For sequence length 131072, this algorithm does 17x fewer DRAM transfers than FlashAttention 3.

This is not a minor optimization. This removes attention as the I/O bottleneck for long context inference. The entire long context scaling equation just changed.

Preisach attention is not just another variant

Most new attention proposals are minor tweaks that give 5% faster inference and break half the downstream tasks. Preisach Attention is different.

It throws out softmax entirely. It replaces it with a hysteretic relay operator that only tracks local extrema in the sequence. It has no positional embeddings. It does not care about token spacing.

It is Turing complete in one layer. Standard transformers require O(log n) layers for the same result. It can compute running range statistics over arbitrary length sequences in O(n log n) total time.

It cannot do random access retrieval. That is not a bug. That is the tradeoff. For tasks that require long episodic memory, not exact positional lookup, this will beat standard attention by an order of magnitude.

Closing observations

None of these papers are incremental. None are marketing for a model release. Every single one of them changes operational decisions that teams are making right now.

This is what progress looks like. Not new benchmark leaderboards. Not bigger model numbers. Papers that give you formal rules that work, that let you stop guessing, that let you stop wasting millions of dollars on things that do not work.

If you only have time to read one, read Complete-muE. It will save you more money next month than anything else you read this year. If you only have time to read the abstracts, read Shannon Scaling. It will change how you think about every training run you ever run again.

We have spent half a decade building systems on rules that were only ever rough approximations. For the first time, we are starting to get actual theory that matches what people observe in production. That is a very good sign.

Six LLM Training & Scaling Papers That Changed The Rules This Month

Shannon scaling kills the monotonic power law ​

Complete-muE ends MoE hyperparameter hell ​

Muon optimizer has a proper theoretical foundation ​

You do not need a strong teacher for distillation ​

Approximate attention hits the I/O lower bound ​

Preisach attention is not just another variant ​