Skip to content

Four Unsolved Transformer Problems That Got Real Progress This Month

#transformer-architecture #llm-research #mode-collapse #multi-agent-systems #multi-turn-conversation

None of these papers got tweeted. None have pretty demo videos. All four solve problems every LLM engineer has fought with at 2am.

Last week four papers landed on arXiv within 36 hours of each other. None came from OpenAI, DeepMind or Anthropic. None had press releases. Together they close gaps that have plagued production LLMs for three years.

This is not incremental benchmark wanking. Every one of these results can be dropped into an existing training pipeline within a month.

The quiet progress no one is talking about

We have spent three years scaling transformers while almost completely ignoring their consistent, repeatable failure modes. Everyone knew these bugs existed. No one had good explanations, and no one had good fixes.

That changed last week.

None of the work described here requires larger models, more training data, or order of magnitude more compute. Every improvement comes from correcting bad implicit assumptions that went unexamined for half a decade.

This is the good stuff. This is the research that actually changes what you can build.

Self anchored drift is why your chatbot lies about things you just told it

Everyone has seen this failure. You give the model three facts across three separate turns. At the end you ask it to combine them. It gets the answer wrong.

If you paste all three facts into a single new prompt, it gets it right 100% of the time. Same information, same model, completely different output.

This is not context window failure. This is not RAG failure. This is self anchored drift.

Every response the model generates leaves a trace. When it responds to partial information, it makes tiny unstated assumptions. Those assumptions get baked into the conversation history. By the final turn, the model is no longer reasoning from your evidence. It is reasoning from its own earlier guesses.

Until this paper there was no named mechanism for this effect. There were only bad workarounds: reset context every turn, re-inject all facts every message, force the model to restate all evidence before answering. All of them waste tokens. None of them fix the root problem.

Canonical context distillation fixes multi-turn consistency

The CCOPD method is almost offensively simple. That is how you know it works.

During fine tuning you run two copies of the same model. One is frozen. You give it the full complete set of information up front. It produces the canonical correct answer distribution.

The second copy is trainable. You feed it exactly the same information, one piece at a time, across a simulated conversation. You add a loss term that forces every intermediate hidden state of the student model to match the hidden state of the teacher model at the equivalent position.

That is it. No new architecture. No extra data. Just an extra forward pass per training example.

Results: 32% relative improvement across 6 task families. Zero drop in single turn performance. Zero shot transfer to tasks the model was never trained on in multi turn form.

Most importantly, this does not just make answers correct. It makes them consistent. The same evidence produces the same answer regardless of the order it arrived in.

This is the single largest improvement to multi turn LLM performance ever published. You should be testing this on your fine tuning runs this week.

We finally know why positional encoding stops mode collapse

For two years everyone has known that deep transformers will eventually mode collapse. Run enough layers, and every token will converge to the exact same vector. All output turns into garbage repeated tokens.

No one knew exactly what stopped this from happening earlier. People guessed positional encoding helped. No one had proof.

The mean field transformer paper finally gives the formal explanation.

Self attention is a contraction mapping. Left to itself, every iteration pulls all token vectors closer together. After N steps everything collapses to a single point.

Positional encoding is not just a way to tell tokens apart. It is a constant repulsive force. Every layer adds back exactly enough variance to counteract the contraction from attention.

This is not a design accident. This is the entire reason transformers work at all.

The paper proves that without auxiliary variables, any self attention stack will provably collapse in finite layers. It also proves that fixed prompt prefixes work exactly the same way. That is why system prompts stabilize output. That is why repeating a fixed token at the start of every sequence stops degradation.

You can stop arguing about RoPE vs ALiBi. All positional encodings do the same core job. They inject enough entropy every layer to cancel the collapse.

Multi agent credit assignment was impossible until last week

Multi agent LLM systems work great when they work. When they fail you have no idea why.

You can run 1000 trajectories. 700 succeed, 300 fail. There is no way to tell if the failure was the planner agent, the verification agent, the aggregation step, or just one bad line of output on turn 4.

All existing optimizers treat the entire prompt stack as a black box. They mutate things at random. This works for 3 agent systems. It does not scale past 5.

The new credit assignment method decomposes the error along two separate axes.

First, temporal credit. It inserts tiny state bottlenecks between every round. It compares the state distribution of successful and failed trajectories to find exactly which round the error entered the system.

Second, structural credit. It holds all other agent prompts fixed while modifying one. This isolates the contribution of every individual role.

Once you have this decomposition you do not need to run random search. You can ask the LLM itself to generate a corrected version of only the broken component.

On MMLU collaboration benchmarks this method reduced the number of required trials by 87% while improving final performance by 19%. It also produces human readable explanations for every change.

This is not a minor optimization. This is the thing that will let multi agent systems actually improve themselves instead of just wandering randomly.

Dual path scaling breaks the looped transformer tradeoff

Everyone has been arguing about looped transformers for 12 months.

Looped models reuse the same layer multiple times. You get more compute for the same parameter count. Everyone agrees they are extremely parameter efficient. Everyone also agrees that at equal FLOPs they are always worse than a standard deep transformer.

This was considered a fundamental tradeoff. It is not.

The dual path architecture puts two separate paths inside every single block.

One path is deep. It is a small feed forward network that runs K times in a loop. One path is wide. It is a large feed forward network that runs exactly once.

Every token gets an independent gate that decides how much output to take from each path.

That is the entire change.

At identical FLOP counts, dual path models beat standard transformers on every downstream benchmark. They beat vanilla looped transformers by 7% average. They use 22% fewer parameters than the baseline at matched performance.

The learned gates are perfectly interpretable. Function words and content tokens go wide. Punctuation, numbers and arithmetic tokens go deep. The model automatically routes different classes of token to the type of computation they need.

This is the first new transformer block design that provides an unambiguous win over the standard architecture in three years. It will be in every base model released 12 months from now.

None of this required larger models

Notice a pattern.

None of these papers scaled parameters. None trained on more data. None used more compute.

Every single improvement came from fixing broken assumptions. Every single one came from looking at actual failure modes instead of chasing benchmark numbers.

We are not running out of low hanging fruit. We just spent three years looking up instead of down.

What you should build next

You can implement any of these this week:

  1. Add the CCOPD loss term to your existing fine tuning pipeline. You do not need new data. You just need to generate sharded versions of your existing examples.
  2. Test adding a fixed 3 token prefix to every sequence. You will see reduced mode collapse at long generation lengths. This is free.
  3. Add state bottleneck checks to your multi agent system. Stop doing full prompt random search.
  4. Replace one layer in your transformer with the dual path block. Run a 100M parameter test run. It will outperform the baseline.

None of these require 1000 GPUs. None require access to closed models. All of them work right now.

Closing observations

This is what useful LLM research looks like.

It does not come with demo videos. It does not come with claims of AGI. It comes with a clear failure mode, a testable mechanism, concrete numbers, and something you can go implement.

For the last two years almost all public research was benchmark chasing. That period is ending. People are finally starting to dig into the actual broken parts of these models.

There will be a dozen more papers like these over the next six months. Most of them will also get no attention. Pay attention anyway.