Video-LLMs don't have perception problems. They have binding problems.

Last week every major Video-LLM benchmark got broken again. GPT-4o Video hit 89% on MVBench. Gemini Advanced cleared 92% on Video-MME.

And on a 10 second clip of a single red square moving left across a white background? 27% accuracy.

That is chance. Four possible answers. The model might as well be rolling a die.

That result comes from the first paper we will look at, and it is not an outlier. Over the last 72 hours five independent papers dropped on arXiv, all probing spatio-temporal reasoning in multimodal models. All arrived at almost exactly the same conclusion, from completely different angles.

Directional motion blindness

The paper Which Way Did It Move? documents what may be the most embarrassing and revealing failure of Video-LLMs published to date.

The authors tested 12 leading open and closed models on clips containing exactly one moving object, no distractors, no occlusion, constant speed. Every model except one performed between 22% and 31% accuracy.

This was not a failure of perception. The authors ran linear probes at every stage of the model pipeline. Motion direction was linearly separable with 97% accuracy from the vision encoder output. It was still present at 92% accuracy after the projector. It remained detectable at 89% accuracy in the LLM's own hidden states immediately before token generation.

The signal never went missing. It just never got connected to the words "left" or "right".

The authors call this a direction binding gap. The model can see motion. It knows what left means. It cannot match those two things together.

Instruction tuning on motion examples closes this gap on synthetic clips, but breaks immediately when you add a textured background. The DeltaDirect objective introduced in the paper fixes this, bringing real world motion accuracy up 21.9 points without any fine tuning on real video. The fix does not touch the vision encoder. It does not touch the LLM. It only adds a tiny auxiliary loss on the projector layer.

That is the pattern. The fix is never in the big parts everyone scales. It is always in the boring interface layers no one pays attention to.

Persistent worlds and memory retrieval

WorldKV addresses the same failure, extended across time.

Autoregressive world models can now generate playable 30fps environments. But if you turn the camera 180 degrees and turn back, the chair that was there will be gone. The cup on the table will have changed colour.

Everyone assumed this happened because the model forgot. This was wrong.

The full KV cache still contains every token from the earlier viewpoint. The model has not forgotten anything. When using sliding window inference, those tokens are just evicted from the active attention window. The model never goes looking for them.

WorldKV does not train a new memory system. It does not add parameters. It adds three hundred lines of code that watch the camera pose, and when you turn back around, it pulls the old KV chunks out of idle memory and shoves them back into the attention window.

That is it.

This one change matches full KV consistency at twice the throughput. Objects stay the same when you leave and return. Nothing was retrained. Nothing was scaled. The memory was already there. It just was not being retrieved.

All existing benchmarks are lying to you

We did not find any of these failures for three years because every benchmark was built to not find them.

VGenST-Bench demonstrates this conclusively. All existing spatio-temporal benchmarks use passively collected video. Every single one has severe dataset bias. Every single one can be beaten by pattern matching and semantic memorization, no actual reasoning required.

The authors instead built a pipeline that actively synthesizes test videos to order. They can generate exactly the same scene, same motion, same lighting, rotated 5 degrees. They can remove every possible confounding variable.

When tested on this benchmark, every state of the art model drops 30-40 percentage points relative to their published benchmark scores.

SpaceDG extends this to real world conditions. It adds physically accurate motion blur, compression artifacts, low light, lens distortion. Every model tested collapsed. Even GPT-4o fell from 81% on clean inputs to 38% under moderate compression that is invisible to human observers.

No one had ever tested this before. Everyone ran benchmarks on perfectly rendered clean frames. No one checked what happens when you feed the model the kind of video that actually comes out of a real camera.

Geometry is not an afterthought

GeoWeaver attacks the root of all these binding failures.

Until now every approach to geometric grounding worked the same way: run the vision encoder, run a separate geometry encoder, concatenate both signals at input to the LLM. This gives small, inconsistent improvements.

This paper shows this approach is backwards.

Geometry cannot be fused at reasoning time. It has to be baked into each individual visual token before it ever reaches the LLM. Different tokens need different geometric information. A token for a wall needs surface normal data. A token for a ball on the floor needs depth and velocity. A token for the sky needs almost no geometry at all.

GeoWeaver runs a single geometric grounding step before the projector. Each visual token retrieves only the geometric evidence relevant to its own spatial role. This change gives consistent 12-18% improvements across every spatial reasoning benchmark, with zero loss on general multimodal tasks.

Most importantly, this change also closes half the motion direction gap measured in the DeltaDirect paper. Binding works much better when the tokens actually carry the right information to begin with.

The unifying pattern

Read all five papers in sequence and the pattern is unmistakeable.

None of these are capacity failures.

None of these will be fixed by training a 2 trillion parameter model.

None of these will be fixed by better training data.

At every level, for every spatio-temporal task, the Video-LLM stack already contains the correct information. The vision encoder works far better than anyone gave it credit for. The LLM reasoning works.

Everything breaks at the boundary. Signals are not routed. Memory is not retrieved. Tokens are not grounded. Features are not bound.

For three years everyone has been scaling the two ends of the pipeline. We made the vision encoder bigger. We made the LLM bigger. We left the 0.1% of parameters that connect them completely untouched. That is where every failure lives.

This is good news. We do not need to wait three more years for the next generation of models. Most of the performance gains we have been chasing are already available right now, for free, if we just fix the wiring.

What this means for deployments

If you are building systems with Video-LLMs today, you can stop waiting for better base models.

Stop fine tuning the LLM. Stop training bigger vision encoders.

Start measuring binding failures. Start testing with degraded inputs. Start checking if the signals you need already exist inside the model, before you start adding parameters.

The hard part of multimodal reasoning was never seeing or thinking. It was connecting the two.

Video-LLMs don't have perception problems. They have binding problems.

Directional motion blindness ​

Persistent worlds and memory retrieval ​

All existing benchmarks are lying to you ​

Geometry is not an afterthought ​