Appearance
If you run diffusion models in production, you already know the dirty secret of this field. Base model quality stopped improving meaningfully 12 months ago. All remaining pain is operational. Inference is too slow. Training gradients are wasteful. Distillation hits hard unexplained ceilings. No one can reliably tell when a model is memorizing instead of generalizing.
Over 72 hours last week, four arXiv papers and one Nvidia code drop landed that address every single one of these problems. None have pretty demo gifs. None claim to be state of the art on FID. All of them will change how you run diffusion in production before the end of the quarter.
You are throwing away 70% of compute on gradient noise
Almost every non-trivial use of diffusion today runs against a frozen pretrained teacher model. This includes text-to-3D distillation, single step student distillation, training data attribution, and guided generation pipelines.
Every one of these pipelines computes gradients as a Monte Carlo expectation over noise samples and timesteps. Until this paper, every implementation did this the stupid way. For every single noise draw, they reran the entire expensive upstream stack: image encoding, text conditioning, renderer forward passes, everything. No one stopped to ask which parts of the computation actually change per sample.
CARV fixes this. The framework splits the Monte Carlo estimator into two tiers. Expensive upstream computation is run once per gradient step. Only the cheap diffusion noise sampling is rerun for the multiple draws required to reduce variance. Add timestep importance sampling and stratified inverse CDF sampling, and you get an effective 2-3x compute multiplier for exactly the same objective. No retraining. No change to convergence. No tradeoffs.
The most important result in this paper is the negative one. When the authors applied CARV to single step distillation, they reduced gradient variance by an order of magnitude. FID did not improve at all.
This is a bombshell. For two years the entire field assumed distillation quality was limited by gradient noise. We were wrong. Variance was never the bottleneck. We were just hitting a different limit entirely, and no one noticed because everyone was too busy fixing the noise.
One step discrete diffusion is now production ready
Discrete token diffusion has always had better sample efficiency and better composition than continuous pixel diffusion. No one uses it at scale. It required 32+ inference steps, and all prior distillation attempts fell apart completely.
Existing distillation methods trained on soft logit outputs from the teacher. They never worked because the student never learned to operate on the actual discrete codebook manifold the teacher uses. It learned to produce good outputs for interpolated soft inputs that never exist at inference time.
Fixed Point Distillation fixes this with one simple, obvious change that no one tried for three years. On the forward training pass, they run hard sampled discrete tokens through the teacher. They use a straight through estimator to route continuous gradients back to the student logits. That is it.
FPD produces single step discrete diffusion models that close 92% of the FID gap to the original 32 step teacher. It outperforms every existing distillation baseline by a wide margin, uses no auxiliary networks, and adds no overhead to the final student model.
This work removes the last remaining reason to run continuous pixel diffusion for most use cases. Discrete models will be the default for edge deployment within 6 months.
Generalization has two separate thresholds
For as long as diffusion models have existed, we have argued about memorization. A model can get near perfect benchmark scores, and still be clearly regurgitating chunks of its training set. No one had a good explanation for this.
We now have the explanation. There are two completely separate, independent transitions that happen during training.
First, the model converges to match the bulk statistical distribution of the training data. This happens early, when training set size is linear in input dimension. This is what FID, IS and every standard metric measures. Once you pass this threshold all your numbers look great.
Second, the model recovers the actual underlying latent factors of the data. This is a sharp, discontinuous transition that happens much later, with far more training data. This is when the model stops memorizing and starts actually generalizing.
You can have a model that has fully passed the first threshold, and not even begun the second. That describes almost every production diffusion model released to date. This is why you get perfect FID scores and models that still draw six fingers. This is why you can run every memorization detection test and come back clean, and still pull exact training copies out of the model.
We have been measuring the wrong thing. All this time we thought we were testing generalization. We were only testing convergence.
A single slider for distortion vs perception
For two years we accepted an unavoidable tradeoff. Regression based restoration methods give pixel perfect outputs in one step, and look terrible. Diffusion methods give beautiful realistic outputs in 50 steps, and will never match ground truth pixels. Everyone agreed this tradeoff was fundamental.
It was not.
DiSI disentangles the stochastic interpolant used by diffusion models into two completely independent terms: one for regression fidelity, one for generative texture. You train the model once. At inference time you adjust a single scalar parameter.
Slide it to 0, you get pure regression output. One step. Exact pixel alignment. Perfect PSNR.
Slide it to 1, you get full diffusion output. 4 steps. Photorealistic texture.
You can stop at any point in between. No retraining. No separate models. No fine tuning.
This is not an incremental improvement. This is the solution to the single most complained about problem with diffusion based restoration. Every image editing pipeline will ship this control before the end of the year.
All progress is now operational
Nvidia dropped Sana, their new linear diffusion transformer, the same day the last of these papers went up. That is not a coincidence.
None of these works cite each other. All were written independently. All arrived within three days. All attack the same unspoken shift in the field.
We are done building better diffusion models. We have models that are good enough for almost every use case anyone actually wants to build. All remaining progress is operational. It is about making the models we already have cheaper, faster, more predictable, and cheaper again.
If you are on an ML team running diffusion right now, you should stop whatever base model experiment you are running this week. Go implement CARV first. It will cut your training costs in half this afternoon. Then port FPD. Then test DiSI.
None of these require new data. None require training a new foundation model. All of them will move the metrics that actually matter: latency per request, cost per image, uptime, user satisfaction.
This is the boring, unglamorous work that turns a research demo into an industrial technology. This is where all the real progress is happening right now.