Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models

Tianren Ma, Mu Zhang, Yibing Wang, Qixiang Ye

2025-10-06

Consolidating Reinforcement Learning for Multimodal Discrete Diffusion Models

Summary

This paper tackles the problem of effectively training discrete diffusion models (DDMs) using reinforcement learning, specifically when you want the model to generate diverse and high-quality outputs based on rewards.

What's the problem?

Training DDMs with rewards is difficult because they don't naturally work with the usual techniques used in reinforcement learning. Normally, you'd want to see how likely a certain action was to have led to a good outcome, but DDMs generate things all at once, not step-by-step. This makes it hard to figure out which parts of the generated output were actually responsible for the reward, and makes the learning process unstable and inefficient. Existing methods like Group Relative Policy Optimization (GRPO) struggle with this.

What's the solution?

The researchers introduce a new method called MaskGRPO. It works by first clarifying the math behind how DDMs function, which allows them to create a better way to estimate the importance of different parts of the generated output. Then, they improved the way the model explores different possibilities during training, making sure it generates a variety of outputs and gets reliable feedback for learning. Essentially, they made the reinforcement learning process more stable and efficient for these types of models.

Why it matters?

This work is important because it provides a practical way to train DDMs using reinforcement learning, opening the door to better performance in tasks like math reasoning, coding, and generating images. It's the first method that can reliably and efficiently optimize these models, which could lead to significant improvements in the quality and capabilities of AI systems that generate complex outputs.

Abstract

Optimizing discrete diffusion model (DDM) with rewards remains a challenge: the non-autoregressive paradigm makes importance sampling intractable and rollout complex, puzzling reinforcement learning methods such as Group Relative Policy Optimization (GRPO). In this study, we introduce MaskGRPO, the first viable approach to enable scalable multimodal reinforcement learning in discrete diffusion with effective importance sampling and modality-specific adaptations. To this end, we first clarify the theoretical foundation for DDMs, which facilitates building an importance estimator that captures valuable token fluctuation for gradient updates. We then delicately tailored the rollout method for visual sequences, which yields diverse completions and reliable optimization gradients. Upon math reasoning, coding, and visual generation benchmarks, MaskGRPO brings more stable and efficient updates, leading to stronger reasoning performance and better generation quality. This study establishes MaskGRPO as a systematic policy optimization approach and the first practical way for discretized visual diffusion.

View Paper