Skip to content

GRPO's compute problem: why most of your gradient budget is wasted

#grpo #rl-alignment #variance-reduction #diffusion-alignment #reward-sparsity

GRPO has become the default post-training alignment method for generative models. It replaced PPO's critic with group-relative advantages, simplified the training pipeline, and worked well enough across language models, vision-language models, and diffusion policies. But as these methods scale to 14B-parameter video diffusion models and multi-step VLA policies, a problem keeps surfacing: GRPO is computationally profligate. It spends gradient compute on parts of trajectories that contribute almost nothing to learning, amplifies variance from irrelevant sources, and loses its learning signal when all rollouts in a group fail.

Six papers from the latest batch of preprints attack this problem from different angles. They share a core observation: GRPO's uniform treatment of trajectory phases, denoising timesteps, and group samples wastes the majority of its compute budget. The fixes differ by domain, but the principle is the same. Allocate compute where the learning signal lives, and skip everything else.

The 78% problem: gradient compute dominates, and most of it is useless

The standard assumption in GRPO-based VLA training is that rollout collection is the bottleneck. Faster simulators and world models were supposed to fix this. Probabilistic Chunk Masking (PCM) measures where the time actually goes, and the numbers contradict the assumption.

In PCM's VLA experiments, gradient computation accounts for approximately 78% of wall-clock time per training step. Rollout collection is only 21%. Speeding up rollouts addresses the wrong cost center.

The deeper issue is what that gradient compute is doing. GRPO assigns the same advantage to every chunk in a rollout. If a robot arm navigation task has a trajectory where success and failure diverge only during the grasping phase, the approach and retraction phases still receive full gradient updates despite carrying near-zero learning signal. PCM formalizes this with per-phase gradient variance: only chunks where successful and failed rollouts produce different actions contribute meaningful gradient signal.

PCM's solution is a drop-in modification. It computes success-failure action variance as a proxy for per-phase gradient variance, then samples a fixed budget of chunks per trajectory using online-updated keep probabilities. The proxy requires no reward model or learned critic because it derives directly from the rollout data GRPO already collects.

The results are striking. On three LIBERO benchmarks, PCM matches standard GRPO's final success rate while backpropagating through fewer than 20% of trajectory chunks. That yields 2.38x wall-clock speedup, 4.8x faster gradient updates, and 60% lower peak activation memory. The implication is blunt: standard GRPO wastes over 80% of its gradient compute on phases the policy already handles correctly after pre-training and supervised fine-tuning.

Timestep-confounded variance in diffusion alignment

Video diffusion models present a different but related problem. GRPO aligns these models by generating groups of videos, scoring them with reward models, and updating the policy. But the denoising process spans hundreds of timesteps, and the relationship between timestep and reward is far from uniform.

Flash-GRPO identifies two specific failure modes when applying standard GRPO to video diffusion. First, timestep-confounded variance: within a group of rollouts sharing the same prompt, different timesteps have vastly different difficulty levels. A policy's performance at early denoising steps (coarse structure) and late steps (fine detail) are confounded when you compute advantages across the full trajectory. The advantage signal mixes structural and textural quality, making the gradient noisy and inconsistent.

Second, time-dependent gradient scaling. The denoising loss has an inherent scaling factor that varies with timestep. Gradient magnitudes at early timesteps can be orders of magnitude larger than at late timesteps. Standard GRPO treats all timesteps equally in the loss, so these scaling differences create wildly inconsistent updates.

Flash-GRPO's fix is radical: abandon full trajectory training entirely. Instead of optimizing across all denoising steps, it trains on single timesteps. Iso-temporal grouping ensures that within each group, all samples are at the same timestep, eliminating the confounding. Temporal gradient rectification neutralizes the time-dependent scaling factor, stabilizing gradient magnitudes across timesteps.

On models from 1.3B to 14B parameters, Flash-GRPO outperforms full trajectory training in alignment quality under low computational budgets. Previous efficiency methods that used sliding window subsampling of timesteps showed severe instability and failed to reach full trajectory performance. Flash-GRPO's single-step approach is both faster and more stable.

Not every denoising step needs RL

AdaScope approaches the same problem from a complementary angle. Rather than restructuring the grouping, it asks: at which denoising stages should we apply RL at all?

The observation is that RL fine-tuning has very different effects at different denoising stages. In early stages, image structures are unstable and far from the final output. Rewards computed at these stages are delayed and mismatched with the actions taken, producing high-variance, inefficient updates. In late stages, reward gains saturate. Continued training overfits to local details and intensifies reward hacking.

AdaScope is an RL-enhanced plug-in that adaptively identifies optimal intervention timing. It monitors structural evolution and semantic consistency during denoising, starts RL when the structure is stable enough for meaningful reward signal, and terminates training once reward gains saturate. This is conceptually similar to PCM's selective backpropagation, but applied at the timestep level rather than the chunk level.

The dual benefit is unusual in optimization. AdaScope improves performance by 66% compared to state-of-the-art methods while cutting computational cost by 59%. You spend less compute and get better results because you stop actively harming the policy with low-signal updates.

Together, Flash-GRPO and AdaScope make a clear case: the denoising trajectory is not homogeneous. Treating it as such is both expensive and counterproductive.

Variance collapse: when GRPO eats its own signal

E²PO (Embedding-perturbed Exploration Preference Optimization) identifies a fundamental instability in GRPO that the previous papers don't address. GRPO's learning signal depends on intra-group variance. The advantage for each sample is computed relative to the group mean, so if all samples in a group are similar, advantages approach zero and learning stops.

The problem is that GRPO actively reduces this variance. As training progresses, the policy converges, samples within each group become more similar, and intra-group variance decays toward zero. This is not a bug in the implementation. It is a structural property of the algorithm. The better your policy gets, the less signal GRPO can extract from group comparisons.

Existing strategies try to address this by varying initial noise or increasing group sizes. Varying noise helps somewhat but doesn't prevent the convergence of the policy's internal representations. Larger groups increase compute cost linearly while providing diminishing returns as the policy narrows its output distribution.

E²PO's approach is to inject structured perturbations at the embedding level within sample groups. These perturbations guarantee a minimum variance floor that preserves the discriminative signal throughout training. The perturbations are not random noise. They are structured to maintain meaningful differences in the representation space while keeping samples within the distribution the policy can learn from.

Applied to flow models, E²PO significantly outperforms baselines in alignment with human preferences. The key insight is that exploration in GRPO-like methods needs to be maintained actively, not assumed. The algorithm's own optimization pressure will collapse the variance it depends on unless you intervene.

When all rollouts fail: reward sparsity in hard cases

The variance collapse problem has a more extreme variant: what happens when every rollout in a group fails? In object-level grounding tasks evaluated with GRPO, rewards are assigned at the response level. If a referring expression is particularly challenging, all candidate responses might score zero. The advantage is zero for every sample. No learning occurs, and the hard cases that most need improvement get none.

Group Revision addresses this with a two-stage process. First, sample an initial response. Then, generate a set of revised candidates that attempt to improve on the initial attempt. The revision process explores the space around the failure rather than treating it as a dead end.

The critical contribution is the consolidation process, inspired by reward shaping. Each revised candidate's improvement over the initial attempt is quantified and converted into informative shaping signals. These signals refine the reward and modulate the advantage, amplifying the influence of high-quality revisions. Even when the absolute reward is zero for all candidates, the relative improvements between the initial attempt and revisions provide a learning signal.

This works because revision creates a structured comparison. Instead of comparing independent samples that all fail, you compare a failure to directed attempts at fixing that failure. The gradient points toward what changed between the attempt and the revision, which is a much more informative direction than the flat signal from comparing independent failures.

On referring expression comprehension, reasoning segmentation, and counting benchmarks, Group Revision achieves consistent gains over standard GRPO-based models. The approach is particularly effective on the hard cases that standard GRPO cannot learn from at all.

GRPO generalizes further than expected

The Reference-Free GRPO for Machine Translation paper makes a different kind of contribution. It shows that GRPO's applicability extends well beyond the decoder-only LLMs where it was developed.

The authors apply GRPO to NLLB-200, an encoder-decoder model at 600M and 1.3B parameter scales, across 13 typologically diverse languages. The reward is a hybrid of LaBSE and COMET-Kiwi scores, both reference-free metrics that require no parallel data at fine-tuning time. This matters because production MT systems for low-resource languages often lack the parallel corpora needed for supervised fine-tuning.

GRPO yields consistent improvements across all 13 languages, up to +5.03 chrF++ for Traditional Chinese. Without any target-language data, it competes with 3-epoch supervised fine-tuning on morphologically complex languages. The pattern is consistent: gains are largest where baseline performance is weakest and reward discriminability is highest.

This is the same variance signal principle from a different angle. GRPO works best where the reward function can clearly distinguish between good and bad outputs. When baseline performance is already high, the advantage variance shrinks and learning signal diminishes. When baseline performance is low but the reward is discriminative, the advantage variance is large and GRPO learns efficiently.

The emerging design principle

These papers converge on a single principle: GRPO's uniform compute allocation is the wrong abstraction. Learning signal in GRPO is driven by advantage variance, and advantage variance is not uniformly distributed across trajectories, timesteps, or training iterations.

PCM shows that per-phase gradient variance in VLA trajectories is concentrated in specific chunks. Flash-GRPO shows that timestep difficulty confounds advantage computation in video diffusion. AdaScope shows that denoising stages differ in how much they benefit from RL. E²PO shows that intra-group variance collapses as training progresses. Group Revision shows that hard cases with zero rewards need structured exploration rather than more sampling. The MT results show that discriminability of the reward function determines where GRPO is effective.

The practical takeaway for anyone running GRPO in production: profile where your compute goes, measure where your learning signal lives, and allocate accordingly. If your gradient compute dominates wall-clock time (and it probably does), masking low-signal chunks or timesteps can give you 2-5x speedups with no performance loss. If your intra-group variance is collapsing, you need active exploration mechanisms, not larger groups. If your reward is sparse, revision-based approaches can extract signal from failures.

The next generation of GRPO variants will likely combine these insights. Selective backpropagation through high-variance phases, adaptive timestep scheduling, embedding-level perturbation for sustained exploration, and revision-based reward shaping are all compatible modifications. The question is not whether these changes improve GRPO. The papers demonstrate that they do. The question is how to compose them correctly and what the theoretical bounds are on the speedup-accuracy tradeoff.

Standard GRPO is a reasonable starting point. But if you're training 14B models for hundreds of GPU days and backpropagating through every chunk and every timestep, you're paying for compute that produces no learning. The data is clear on this.