Skip to content

Three New Autoregressive Architectures That Just Dropped And What They Mean

#autoregressive-models #generative-ai #multimodal #world-modeling #diffusion

On June 17 2026, three separate research groups uploaded almost identical timestamped papers to arXiv. None cite each other. All arrived at the same conclusion.

Autoregressive modeling is not dead. It was just being implemented wrong.

For two years the field had moved almost entirely to parallel denoising diffusion, flow matching, masked generation. Everyone agreed autoregressive next token prediction was a dead end for anything except text. All three papers prove that wrong, and they do it with almost identical architectural tricks.

The quiet autoregressive comeback

This was not supposed to happen. As recently as Q1 2026 every major industry roadmap had autoregressive generation marked as legacy. It was slow. It drifted. It produced worse sample quality than diffusion. Everyone was working on exit plans.

All three papers released this week discard that consensus. None of them argue for autoregressive modeling on ideological grounds. They just present hard numbers. Numbers that beat every existing diffusion baseline on speed, convergence, and final quality.

None of them invent a new attention mechanism. None of them propose a new noise schedule. None of them require new hardware. Every improvement comes from how we supervise models during training, not what the model is.

ARM: One tokenizer to rule all multimodal tasks

ARM solves the single problem that killed every prior autoregressive multimodal model. The visual tokenizer was bad. Not slightly bad. Catastrophically bad.

Every previous attempt used off the shelf VQGAN or VAE tokens trained only for reconstruction. These tokens encode pixel level detail very well, but carry almost no semantic structure. When you feed them into an autoregressive transformer the model learns to produce locally plausible garbage that drifts completely out of alignment with text prompts.

ARM trains a single tokenizer on three joint objectives:

  1. Standard pixel reconstruction loss
  2. Semantic discriminability loss against an image classifier
  3. Explicit alignment loss against CLIP text and image embeddings

No one had trained all three together before. The resulting token space works equally well for understanding, generation and editing. There is no separate encoder. No separate diffusion head. Just a standard 7B Llama style decoder transformer that eats text tokens and image tokens interchangeably. You can prompt it with text, full image, partial image, text + partial image. It just predicts the next token. That is the entire architecture.

The most surprising result comes last. The team ran reinforcement learning against human preference labels for text to image quality. It did not just make images better. It made editing better. It made VQA better. It made OCR better. Preference optimization generalized across every task the model could do. That should not have happened. No one predicted that.

WISE score improved from 0.50 to 0.56. GEdit-Bench-EN score jumped from 5.75 to 6.68. This is an enormous jump for a post training step that did not touch any training data.

The unspoken flaw in single step prediction

ARM fixed the representation problem. But all autoregressive models still had one fundamental flaw that no one had properly addressed. During training you only ever supervise the very next step.

The model never gets any signal about what happens after that. It does not need to learn long term causality. It only needs to learn what one step ahead looks like. This is why video world models drift. This is why generated videos fall apart after 16 frames. This is why lip sync drifts.

Everyone knew this. No one had a good fix. Until this week.

Next Forcing: Multi chunk supervision for world models

Next Forcing attacks exactly this flaw. They do not change the base model architecture at all.

They add tiny lightweight linear auxiliary heads attached to every third layer of the transformer. Each head predicts not the next chunk, but the chunk after that, and the one after that. Three heads total, predicting 1, 2, 3 chunks ahead.

All heads run during training. All contribute gradient. The base model now gets supervision signal for three different time horizons at every training step. It cannot just learn a good one step approximation. It has to learn actual dynamics that hold across time.

At 50fps on RoboTwin, Next Forcing delivers 93.1% relative improvement over LingBot-VA at 5k training steps. Convergence is 2.3x faster. At inference you can throw the auxiliary heads away entirely. All the improvement stays in the base model. You pay zero overhead at runtime.

If you keep the heads attached at inference you get an additional 2x speedup. You can run the next chunk prediction in parallel while you are still outputting the current chunk.

This is not a world model trick. This is a general training trick that works on any sequence model.

Lip Forcing: The same trick applied to real time lip sync

Lip Forcing is not just a lip sync paper. It is the exact same multi horizon supervision trick, applied to diffusion.

Prior state of the art lip sync models used full bidirectional attention over the whole sequence. They required 50 denoising steps. They ran at 0.8 FPS. They could not be used for live streaming.

Lip Forcing distills the bidirectional teacher into an autoregressive student. During distillation they train the student to predict not just the current frame, but the next two frames as well. Exactly the same auxiliary head structure from Next Forcing.

The result is 2 total denoising steps. No classifier free guidance at inference. The 1.3B student runs at 31 FPS, 17.6x faster than an equivalent bidirectional model. Time to first frame is under 1ms. That is good enough for live video calls. No one has ever hit that before.

The 14B student, the largest diffusion model ever reported for video to video lip sync, runs 39.8x faster than its teacher at comparable visual quality.

The shared design pattern across all three papers

None of the papers note this, but all three use the exact same three part architecture:

  1. A standard unmodified autoregressive decoder base
  2. Multi horizon auxiliary prediction heads used only during training
  3. Preference or trajectory alignment applied after base training

That is it. That is the entire secret.

We have spent three years optimizing every detail of the transformer forward pass. We have argued endlessly about diffusion vs flow vs autoregressive. It turns out none of that mattered. We were just training all of them wrong.

The biggest improvement to generative modeling in three years is not a new architecture. It is a better loss function.

What breaks now

Every existing benchmark published before June 2026 is now obsolete. Every baseline will be beaten by 30-100% just by adding multi chunk supervision. No architecture changes required.

Inference latency numbers everyone considered fundamental limits are gone. 14B diffusion models running at 30 FPS was considered impossible last month. It exists now.

RLHF is no longer just for chat models. It works for generation. It works for editing. It generalizes across tasks. Every foundation model will run preference tuning after base training from this point forward.

No one will be training single step autoregressive models 12 months from now.

Open questions no one is asking yet

How far back can you push this? Can you add 10 prediction heads? 100? The papers tested 3 heads and got linear gains. No one has tested where the curve flattens.

Does this work for text? No one has tried multi token prediction auxiliary heads on LLM base training yet. Every LLM today still only trains on next single token prediction. We might be leaving 2x convergence speed on the table there too.

Why does RL on one task improve every other task? The ARM paper notes this result but does not explain it. No one has an explanation. We are observing an emergent property of large models that we do not have a theory for.

Implementation notes

All three papers will have reference code released before the end of June. You can implement multi chunk supervision on any existing autoregressive model this weekend. It requires changing approximately 12 lines of code in your training loop. You do not need more compute. You do not need more data. You just need to add the extra heads and sum the loss.

That is the most ridiculous part of this entire batch of work. The biggest improvement to generative modeling in years is 12 lines of code that no one bothered to write.

Closing

This is not an incremental improvement. This is a reset.

For three years we have been running in circles arguing about architectural paradigms. We have produced hundreds of papers proposing minor variations on the same transformer block. We have argued for thousands of hours about which generation framework is inherently superior.

None of that mattered. We were all just training our models wrong.

That is the thing about this field. Very often the answer is not some brilliant new invention. It is something obvious that everyone just overlooked for three years.


References

  1. ARM: An AutoRegressive Large Multimodal Model with Unified Discrete Representations http://arxiv.org/abs/2606.11188v1
  2. Next Forcing: Causal World Modeling with Multi-Chunk Prediction http://arxiv.org/abs/2606.11187v1
  3. Lip Forcing: Few-Step Autoregressive Diffusion for Real-time Lip Synchronization http://arxiv.org/abs/2606.11180v1