Skip to content

The Quiet Breakthroughs In LLM Reasoning And Memory That No One Is Talking About

#llm-reasoning #working-memory #agent-memory #transformer-mechanics #llm-benchmarks

Six papers landed on arXiv last Tuesday that together change almost everything we thought we knew about LLM reasoning.

None of them introduce a new model. None scale parameters. None use proprietary training data. None benchmark against each other. None got posted to Twitter. All of them attack the unstated bad assumptions that this entire field has operated on for the last three years.

Stop generating thought tokens

For two years we have operated under one unchallenged rule: to make an LLM reason, you make it generate intermediate tokens. Chain of thought, scratchpads, reflexion, every single improvement worked by forcing the model to externalize every step of its thinking.

This was always a weird hack. Humans do not narrate every single intermediate thought out loud before answering a question. We hold state in working memory. We manipulate it internally. We only output the result.

Reasoning in Memory (RiM) demonstrates we were wasting 70% of our compute for no reason. Instead of generating autoregressive thought tokens, RiM inserts fixed sequences of unused special tokens into the forward pass. These tokens act as scratch space inside the model's activations. The model can write, modify and read state to these blocks entirely internally. No tokens are generated. No sampling happens. Everything runs in one single forward pass.

Across every tested model family and size, RiM matches or outperforms standard chain of thought. It does this while using between 30% and 72% less compute per reasoning step.

We did not need to make models bigger to get better reasoning. We just needed to stop forcing them to talk to themselves out loud.

You are sampling reasoning wrong

Last month the field lost its mind over the discovery that you could match RL fine tuned reasoning performance by just re-sampling traces from raw base models, with no additional training at all. Everyone missed the critical flaw in that method.

The original algorithm cut reasoning traces at uniformly random positions, then re-sampled the remainder. This works about as well as you would expect. 95% of tokens in a reasoning trace are irrelevant filler. Cutting there just rewrites punctuation and local phrasing. It almost never revisits the actual decision that broke the answer.

Entropy-Cut Metropolis-Hastings fixes this. The algorithm uses next token entropy as a proxy for decision points. Wherever the model was uncertain what token to output next, that is where a choice was made. That is where you cut.

This one change reduces mixing time from scaling with the total number of tokens in the trace, to scaling only with the number of actual decisions made. On MATH500, HumanEval and AIME26 this method beats every published RL fine tuned model. It requires zero training, zero fine tuning, zero curated datasets. It runs on any base model.

This is not an incremental improvement. This is an existence proof that most of the reasoning performance gains attributed to reinforcement learning over the last year were just working around bad sampling algorithms.

Planning was missing the first step

Every plan based reasoning system uses the same pipeline: question, plan, execution. This works better than nothing. It also skips the step every human does before they make a plan.

Before you decide how to solve a problem, you first decide what problem you are solving. You classify the problem type. You note common pitfalls. You list which tools apply. No LLM reasoning system had an explicit stage for this until this week.

The Preplan-Plan-CoT framework inserts this single extra stage. It adds no extra inference tokens. It does not increase runtime. Across four base models and five mathematical reasoning benchmarks it won 39 out of 40 measured metrics. It improves maj@16 by 2.23% and pass@16 by 3.06% over the strongest existing baseline.

This is the most embarrassing kind of breakthrough. It is the thing that is so obvious once someone points it out that you will spend the next week wondering why you did not think of it three years ago.

LLMs do not track state incrementally

This paper is the most important negative result published in this field in the last two years.

Everyone assumed that when an LLM reads text, it incrementally updates its internal model of the world as it goes. This is not what happens.

LLMs do not update state as they read. They store nothing. They hold nothing. They wait until the very last token of the input, and only then attempt to reconstruct all required state in one single parallel pass.

This explains almost every dumb, consistent failure you have ever seen an LLM make. This is why they forget you moved the ball. This is why they get ownership wrong. This is why they fail simple counting tasks. This is why adding an extra irrelevant sentence will flip their answer on questions they otherwise got right.

Worse, the REMOVE operation is implemented as a fragile global suppression tag. If the model marks something as removed, it will leak that suppression across unrelated identical entities anywhere else in the context. This failure mode was predicted directly from mechanistic analysis, then confirmed behaviourally.

We have spent three years building agent systems on top of a capability that does not actually exist.

Belief management is the real agent bottleneck

Long horizon agents do not fail because they cannot plan. They do not fail because they cannot write code. They fail because they cannot manage their own beliefs.

They update on irrelevant noise. They fail to update on hard evidence. They forget what they concluded three turns ago. No one had ever properly measured this until now.

The new BeliefTrack benchmark shows vanilla LLMs fail belief management operations 68% of the time. Explicit belief tracking prompts reduce this failure rate by 12%. Training with GRPO on belief state rewards reduces it by 70.9%.

Building on this, Meta-Cognitive Memory Policy Optimization abandons the standard approach of training memory policies only on final task success. Instead it penalizes memory summaries that leave the model with high epistemic uncertainty about the current task state. Agents trained this way retain 97.1% of their performance even when scaled out to 1.75 million token contexts.

No one was measuring this. Everyone was just scoring whether the agent completed the task at the end. No one checked if it actually understood what was happening along the way.

What this changes

None of these works are incremental. None of them are chasing the next parameter milestone. All of them are pulling the field out of a local maximum that we have been stuck in since chain of thought was published.

We were not hitting the limits of what existing models can do. We were just using them wrong.

Half of these techniques can be deployed next week on existing production systems with no model retraining. The other half require only lightweight fine tuning that almost every team can run.

There will not be one big announcement. There will not be a new flagship model. But 12 months from now, every LLM reasoning system, every agent framework, every production endpoint will be running variants of the ideas published last Tuesday.