Skip to content

The Week Video Generation Became World Models

#video-diffusion #world-models #4d-generation #transformers #causality

Between May 19 and May 21 2026, seven related papers landed on arxiv. None of them got the viral demo treatment. None announced a new flagship model that generates 30 second cat videos.

Every single one solved a fundamental, unglamorous, blocking problem that has stopped video diffusion from evolving into actual world models.

Taken together, they end the first phase of generative video. We are no longer building systems that produce static clips. We are now building systems that simulate running worlds.

The static anchor bug that killed all long video

For two years everyone noticed the same failure mode in autoregressive video generation. You could generate 10 good seconds. After that, nothing moved. The camera would not pan. Objects would not leave the frame. The whole scene would freeze into a barely animating diorama, even if you explicitly asked for motion.

Nobody had correctly diagnosed why until AdaState.

It turns out every streaming video model was hard wired to treat the first generated frame as a permanent privileged anchor in the attention cache. That first frame was the cleanest, lowest noise entry in the sequence. Attention heads learned to default to it. Over time all new generation would regress to match that first frame, suppressing all change. Motion was actively penalized.

AdaState replaces this frozen anchor with a single unrendered latent state, updated at every generation step. The model never references the original first frame again. It carries forward only its own evolving summary of the scene. There is no absolute time. Every step uses identical positional structure.

This is a one line architectural change with no extra parameters. It removes the hard upper limit on video length that every existing model had. You will not see this mentioned in demo tweets. It is the most important improvement to video generation in 18 months.

We can finally measure causality, not just clip quality

All video generation benchmarks until now measured how good a clip looked. None measured if it made sense.

YoCausal fixes this. The authors did something extremely obvious and extremely clever: they took real world video, reversed it, and tested if models could tell the difference. This is not a trick. This is the same violation of expectation test used to measure causal reasoning in human infants.

They tested 13 state of the art models. The results are brutal. Every model can reliably detect the arrow of time. None of them can distinguish causal violations from trivial temporal order. A model will correctly tell you that smoke does not flow into a fire. It will happily accept that a glass jumps off the floor and reassembles itself on the table.

Most importantly: there is zero correlation between standard human preference scores and causal performance. The model that wins all public leaderboards finished dead last on causality.

This is the new benchmark. Every paper from this point on will have to report a CCI score. We are no longer optimizing for pretty. We are optimizing for correct.

The attention scaling wall fell

Generating long high resolution video died on the quadratic cost of self attention. Every proposed sparse attention scheme degraded quality too badly to use at useful sparsity ratios.

Veda changed this. The authors made one simple observation: generation quality does not depend on how many attention tokens you keep. It depends entirely on which ones you keep. All prior sparse attention methods were picking the wrong tokens.

Veda trains tile selection masks by reconstructing full attention patterns, not by heuristic. On Waver-T2V-12B it delivers 5.1x end to end speedup for 720p 10 second video with zero measurable quality loss. Attention overhead drops from 92% of runtime to 50%.

Gains go up as sequence length increases. This is not a 2x improvement. This removes the last practical barrier to running 1000 frame video generation on consumer hardware.

4D dynamics stopped being a physics problem

For years everyone tried to build 4D generation by bolting classical physics engines on top of generative models. It never worked. It never generalized.

NeuROK and PhyGenHOI abandoned this approach entirely.

NeuROK learns a full latent kinematic space for arbitrary objects. It does not know Newton's laws. It does not have a hard coded mass or friction parameter. It just learns every plausible way an object can deform and move, then runs dynamics entirely inside that latent space.

PhyGenHOI extends this to human object interaction. It couples a motion diffusion model with a differentiable material point simulator, and only enforces physical consistency at the exact moment of contact. Everything before and after is generated. It produces correct momentum transfer, correct deformation, correct object behaviour, without forcing the entire generation pipeline to obey a hand written physics model.

We are not simulating physics anymore. We are emulating it. This works much better.

You can build a working world model this weekend

Six months ago building an interactive video world model required a 20 person research team and 1000 A100 hours.

Today you can run minWM.

minWM is a full stack open source pipeline that takes any existing public video diffusion model, and converts it into a real time camera controllable autoregressive world model. It implements every required step: fine tuning for camera control, causal forcing, consistency distillation, few step rollout.

It works out of the box on Wan2.1 and HY1.5. It comes with runnable scripts, precomputed checkpoints, and full ablation data for every hyperparameter.

This is not a research prototype. This is a standardised build system for world models. This is the equivalent of Stable Diffusion v1 coming out, except instead of generating images it generates running interactive environments.

The quiet threshold

None of these papers are perfect. All have limitations. None will give you a holodeck next Tuesday.

But look at the list of problems that were open 30 days ago:

  • Long video would always degrade into static. Fixed.
  • We had no objective measure of causal understanding. Fixed.
  • Attention scaling was hitting a hard architectural wall. Fixed.
  • General 4D object dynamics did not exist. Fixed.
  • There was no reusable pipeline for interactive world models. Fixed.

This is not incremental progress. This is the point where a field crosses a threshold. For three years we have been building better and better video generators. This week we stopped. We started building world models.

Nobody put out a press release. Nobody made a viral demo. This is how most important progress happens.