MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation

Published 8 Apr 2026 in cs.CV | (2604.06966v1)

Abstract: Reinforcement learning (RL) has been successfully applied to autoregressive (AR) and diffusion models. However, extending RL to hybrid AR-diffusion frameworks remains challenging due to interleaved inference and noisy log-probability estimation. In this work, we study masked autoregressive models (MAR) and show that the diffusion head plays a critical role in training dynamics, often introducing noisy gradients that lead to instability and early performance saturation. To address this issue, we propose a stabilized RL framework for MAR. We introduce multi-trajectory expectation (MTE), which estimates the optimization direction by averaging over multiple diffusion trajectories, thereby reducing diffusion-induced gradient noise. To avoid over-smoothing, we further estimate token-wise uncertainty from multiple trajectories and apply multi-trajectory optimization only to the top-k% uncertain tokens. In addition, we introduce a consistency-aware token selection strategy that filters out AR tokens that are less aligned with the final generated content. Extensive experiments across multiple benchmarks demonstrate that our method consistently improves visual quality, training stability, and spatial structure understanding over baseline GRPO and pre-RL models. Code is available at: https://github.com/AMAP-ML/mar-grpo.

Abstract PDF Upgrade to Chat

Authors (9)

Summary

The paper presents a stabilized GRPO method that uses multi-trajectory expectation to reduce gradient variance and improve RL optimization in hybrid image synthesis.
It incorporates token-level uncertainty and consistency-aware selection to focus updates on semantically coherent regions, thereby enhancing prompt fidelity and image quality.
Empirical results demonstrate superior performance on human preference metrics and compositional benchmarks, validating the method's efficacy and stability.

Stabilized Group Relative Policy Optimization for AR-Diffusion Hybrid Image Generation

Introduction and Motivation

Masked Autoregressive (MAR) models have become prominent for text-conditioned image synthesis, leveraging the strengths of autoregressive (AR) and diffusion mechanisms. By integrating AR transformers that predict continuous latent features with a lightweight diffusion head that refines these latents, MAR frameworks can mitigate the quantization bottleneck common to discrete tokenizers and encode richer distributions. Nonetheless, existing MAR models suffer from suboptimal prompt fidelity, visual quality issues, and training instabilities when subjected to reinforcement learning (RL)-based post-training. Notably, the adoption of Group Relative Policy Optimization (GRPO), a sample-efficient RL variant, has struggled due to the interplay of interleaved AR and diffusion steps, inducing noisy gradient updates and reward non-stationarities.

"MAR-GRPO: Stabilized GRPO for AR-diffusion Hybrid Image Generation" (2604.06966) presents a principled resolution to these optimization pathologies, explicitly identifying the diffusion head as a dominant noise source and introducing new variance reduction techniques tailored for MAR-style RL optimization.

Technical Contributions

The paper formulates three core innovations for stabilizing RL in MAR-based AR-diffusion hybrid models:

Multi-Trajectory Expectation (MTE): By averaging optimization signals across multiple diffusion trajectories conditioned on the same AR latent, MTE yields lower-variance, more reliable policy gradients. The proposed mechanism directly reduces randomness injected by the inherently stochastic multi-step denoising process of the diffusion head, which otherwise yields inconsistent outputs and noisy credit assignment for policy improvement.
Uncertainty-Guided Selective Application: Uniform application of MTE can induce over-smoothing and hamper peak performance. A token-wise uncertainty map, computed as the standard deviation across multiple diffusion samples, allows the framework to target only the top-k% most uncertain spatial tokens. This selective smoothing ensures that variance reduction is concentrated on ambiguous, structurally complex regions while leaving stable areas to benefit from sharper gradients.
Consistency-Aware Token Selection: Not all tokens predicted by the AR transformer are equally consistent with the final output. By computing the change in similarity (e.g., cosine similarity) between intermediate latents at different AR steps and the final latent, the method applies a mask to exclude tokens whose optimization direction does not positively contribute to the ultimate image. This further improves convergence by focusing RL updates on semantically coherent content.

Analysis of Baseline Instabilities

The authors systematically dissect the causes of RL instability in MAR models:

Gradient Variance and Non-Stationarity: End-to-end optimization with GRPO leads to drastically increasing gradient norms and variance due to both the stochasticity of the diffusion head and parameter scale disparity between AR and diffusion modules. This provokes early performance degradation and reward hacking, notably in human preference metrics.
Role of the Diffusion Head: Freezing the diffusion head leads to significant stabilization, implying that the main learning signal should be directed at the AR transformer while the diffusion component serves as a fixed stochastic decoder. Even minor adjustments to the diffusion head (such as lowering its learning rate) degrade stability, highlighting the system's pronounced sensitivity to decoder mapping changes.
Empirical Confirmation: By comparing models that optimize the AR transformer, the diffusion head, or both, the authors demonstrate that high stability and consistent performance gains can only be achieved by focusing learning on the AR transformer with the diffusion head kept fixed or minimally updated following a short tuning phase for under-trained decoders.

Experimental Evaluation

The approach is validated on multiple benchmarks and base models, including NOVA and Harmon. Evaluation on human preference metrics (HPS, ImageReward, PickScore, Aesthetic Score) and spatial-compositional accuracy (T2I-CompBench) demonstrates:

Consistent and Superior Performance: MAR-GRPO, incorporating both MTE and consistency-aware token selection, achieves higher peak rewards and improved metric scores compared to vanilla GRPO, GRPO with only the decoder fixed, and pre-RL baselines.
Stable Training Dynamics: Training curves show smoother reward growth, smaller and more stable gradient norms, and less KL divergence drift. This directly addresses gradient collapse and mode dropping observed in baseline GRPO models.
Qualitative Improvements: Image generations exhibit more detailed structural consistency, finer textures, and higher fidelity in object relationships (notably in compositional and counting tasks). The method mitigates issues such as unnatural textures or incoherent spatial layouts found in previous RL-tuned models.

Ablations and Parametric Insights

Extensive ablations quantify the impact of each methodological refinement:

Increasing the number of diffusion trajectories yields improved stability up to a saturating point; beyond this, over-smoothing occurs.
The top-k% uncertainty threshold balances stability and performance, with optimal results at around 30% selection.
Similarity threshold tuning for token selection confirms that stricter filtering hinders optimization, while overly loose thresholds re-introduce instability.
The framework introduces negligible training overhead due to the inherently light-weight diffusion head, allowing scalable deployment on current MAR backbones.

Limitations and Future Work

The paper acknowledges several open directions:

Base Model Sensitivity: Not all MAR architectures benefit equally, likely due to differences in pretraining data and diffusion head designs. Harmon, for example, sees smaller relative gains than NOVA.
Video Generation and Larger Diffusion Heads: The approach is not explored for AR-diffusion video synthesis frameworks or models employing deeper diffusion modules (as in BLIP-3o [2]), both of which raise distinct optimization challenges.
Token/Step-Level Credit Assignment: Future research could further refine RL signal assignment by explicitly modeling token or AR/diffusion step informativeness within the hybrid generative process.

Conclusion

MAR-GRPO establishes a robust, variance-reduced RL optimization paradigm for AR-diffusion hybrid image generators by identifying and mitigating the destabilizing role of the diffusion head through multi-trajectory expectation and token-level selection. The method achieves demonstrated and superior improvements in image quality, compositional accuracy, and training stability across established benchmarks, with minimal computational cost. This stabilization framework provides a theoretically motivated and empirically validated foundation for RL-based post-training in MAR models, enabling broader and more reliable application to text-to-image and, potentially, video generation domains. The analysis and approach outlined in this work should catalyze further progress in scalable, compositional RL for hybrid generative architectures (2604.06966).

Markdown Report Issue