Skip to content

Five new papers that fix the actual broken parts of multimodal LLMs

#multimodal-llm #visual-reasoning #knowledge-editing #llm-benchmarks #icml-2026

Every production team running multimodal LLMs right now is hitting the same walls.

You can drop a 1T parameter MoE model on a problem. It will correctly identify a cat in a photo. It will fail to count how many bolts are on the left flange of that engineering diagram. It will confidently lie about the value of the third bar on a line chart. It will never forget that wrong fact you accidentally taught it last week.

For two years every major release only scaled parameters and input resolution. None fixed the underlying failures. This week five papers landed ahead of ICML 2026 that actually do. None of them require training a new base model. Almost all work as drop-in improvements on every major open and closed model available today.

The unspoken production bottlenecks

All five papers converge on the same observation: almost none of the remaining multimodal failures come from insufficient model capacity. They come from bad interfaces between the vision and language components, bad training supervision, bad evaluation, and bad assumptions about how reasoning should work.

Right now every multimodal LLM does exactly one thing: it encodes an entire image into a single static sequence of tokens, then passes that sequence to a text LLM to reason over. This was a great shortcut to get working systems. It is also the root cause of every remaining hard failure.

Every work below attacks this stack at a different layer.

Stop thinking with text: ETCHR edits images while reasoning

Chain of thought works for text. It does not work for vision.

When a human is asked "is this chair oriented to the left or right?" they do not describe the chair out loud to themselves. They rotate the mental image. When asked to count screws they zoom in on each region one at a time. They edit their perception of the image as they reason.

ETCHR formalizes this. Instead of forcing the LLM to describe all reasoning steps in text, it gives the model a dedicated, reasoning-aware image editor. The model can issue commands like "rotate this view 45 degrees", "crop to the top left corner", "highlight all bolts" at any point during reasoning. The editor executes the edit, the updated image is passed back, and reasoning continues.

This is not another tool use wrapper. The editor was trained explicitly for reasoning, not general image editing. It will not draw cats. It will reliably perform exactly the visual transformations required to answer a question.

Most importantly, ETCHR is fully decoupled. It plugs in training free to any existing MLLM. It lifted Pass@1 4.8% on Qwen3-VL-8B, 5.5% on Gemini 3.1 Flash Lite, and 4.6% on Kimi K2.5 across five very different reasoning tasks. No fine tuning required. No changes to the base model.

This is the single largest general purpose improvement to multimodal reasoning published in the last 12 months.

You don't need higher resolution, you need better supervision: PGT

Every vendor roadmap this year promises 8k, 16k, 32k input resolution. Everyone has been telling you that bad fine grained perception happens because the model can not see well enough.

PGT proves this is wrong.

The authors show that almost all spatial reasoning failures come not from resolution limits, but from missing supervision. Current multimodal training sets never ask the model to count things, measure distances, or report relative positions. They only ask the model to name things. So that is the only thing models learn to do well.

PGT is a dead simple data generation framework. It takes any existing training image, overlays geometric primitives, and generates thousands of trivial unambiguous questions: "How many circles are above the square?", "Is the triangle left of the line?".

Fine tuning LLaVA-v1.5 on just 100k of these synthetic examples improved performance on What'sUp by 20%. That is a larger gain than anyone has ever got from doubling input resolution, doubling parameter count, or any architectural change published to date.

Even better: this gain does not come at the cost of general capability. General benchmark scores stayed identical. The model just learned to actually see what is in the image.

Knowledge editing doesn't work for images yet: ASAM

You can edit a fact in a text LLM. You can reliably change the answer to "what is the capital of France" without breaking anything else.

No one has been able to do this for multimodal models. If you edit a fact to teach a model that a new logo is the logo for Coca Cola, it will only recognise that exact image of the logo. It will not generalise to different angles, crops, colours or backgrounds. It will still recognise the old logo.

This paper explains why. Existing edit methods target a single sample in the latent space. They do not edit the entire semantic subspace that corresponds to the concept.

The authors introduce two techniques: Latent Adversarial Robustification which generates all semantically equivalent variants of a concept during edit, and Rank Constrained Subspace Learning which aligns the entire subspace to the new value.

Initial results show edits now generalise correctly across 92% of visual variations, up from 41% with prior methods. Performance on unrelated tasks dropped less than 0.3%. This is the first work that makes multimodal knowledge editing even remotely usable for production.

We have been evaluating charts wrong: ChartFI

Every chart benchmark right now is useless.

Existing tests only check if a model can repeat the exact numbers shown on a chart. They never check if the model understands what the chart means. They do not penalise hallucinated trends, missed outliers, or completely wrong conclusions that happen to include the right numbers.

ChartFI fixes this. It is a new benchmark of 896 real world charts, scored across four dimensions: factual accuracy, correct emphasis of important features, appropriate context, and actual insight.

When run against 12 leading models, every single model scored under 50% on the insight metric. Even the best closed models reliably missed obvious outliers, reversed trends, and invented correlations that did not exist.

You have probably seen dozens of tweets showing GPT-7 or Kimi correctly reading numbers off a bar chart. Almost all of those results are meaningless. None of those models can actually explain what the chart is telling you.

Stop slicing images into grids: CVSearch

Right now every multimodal model handles high resolution images exactly the same way. They cut the image into a fixed grid of 256x256 tiles, encode every tile, and pass all of them to the LLM.

This is unbelievably wasteful. It also breaks objects that land on tile boundaries.

CVSearch replaces this brute force approach with an adaptive search loop. First the model looks at the full low resolution image. If it can answer the question, it stops. If it needs more detail, it only zooms in on the semantically relevant regions. It never cuts an object in half. It never wastes compute encoding blank sky or empty white space.

On 4k and 8k benchmarks this approach matches the accuracy of full grid scanning while using 62% less compute. It also eliminated 78% of the boundary fragmentation errors that are responsible for almost all weird high resolution failures.

CVSearch is training free, open source, and drops directly into any existing MLLM inference stack. Code is available now.

What comes next

None of these works are flashy. None of them announce a new 2T parameter model. None of them have fancy demo videos.

That is exactly why they matter. We have passed the point where scaling parameters delivers meaningful gains for real world use cases. All the remaining gains will come from fixing the boring, broken, unglamorous parts of the stack.

Every one of these papers will be running inside production inference endpoints before the end of the year. If you are operating multimodal models today you should be testing all of them this week.