Inpainting-Guided Policy Optimization for Diffusion Large Language Models
Siyan Zhao, Mengchen Liu, Jing Huang, Miao Liu, Chenyu Wang, Bo Liu, Yuandong Tian, Guan Pang, Sean Bell, Aditya Grover, Feiyu Chen
2025-09-15
Summary
This paper explores a new way to train powerful language models, specifically those built using a 'masked diffusion' approach, to solve complex problems like math questions. It focuses on making the training process more efficient and effective by cleverly guiding the model's learning.
What's the problem?
Training large language models with reinforcement learning is tough because the models often get stuck and don't receive helpful feedback when they make mistakes. It's like trying to learn something new with no guidance – you waste a lot of time going down the wrong path. This is especially true when the reward for success is rare, meaning the model doesn't often know when it's doing things right. Traditional methods struggle with this 'exploration' problem, leading to slow learning and wasted resources.
What's the solution?
The researchers developed a technique called IGPO, which stands for Inpainting Guided Policy Optimization. Imagine the model is trying to solve a math problem step-by-step. Instead of just letting it try completely on its own, IGPO strategically 'fills in' some of the correct steps along the way. It doesn't give the whole answer, but provides hints to steer the model towards a good solution. This helps the model learn more efficiently and avoid getting stuck. They also improved how the model learns from examples by rewriting them to be more concise and match how the model generates text. Finally, they used a filtering method to focus on more promising solutions.
Why it matters?
This work is important because it significantly improves the performance of masked diffusion language models on challenging mathematical tasks, achieving state-of-the-art results. By making reinforcement learning more efficient, it paves the way for building even more powerful and capable AI systems that can tackle complex problems in various fields. The inpainting technique offers a novel approach to guiding exploration in language models, which could have broader applications beyond just math.
Abstract
Masked diffusion large language models (dLLMs) are emerging as promising alternatives to autoregressive LLMs, offering competitive performance while supporting unique generation capabilities such as inpainting. We explore how inpainting can inform RL algorithm design for dLLMs. Aligning LLMs with reinforcement learning faces an exploration challenge: sparse reward signals and sample waste when models fail to discover correct solutions. While this inefficiency affects LLMs broadly, dLLMs offer a distinctive opportunity--their inpainting ability can guide exploration. We introduce IGPO (Inpainting Guided Policy Optimization), an RL framework that strategically inserts partial ground-truth reasoning traces during online sampling. Unlike providing full solutions, inpainting steers exploration toward promising trajectory spaces while preserving self-generated reasoning, bridging supervised fine-tuning and reinforcement learning. We apply IGPO to group-based optimization methods such as GRPO, where exploration failures cause zero advantages and gradients. IGPO restores meaningful gradients while improving sample efficiency. We also propose supervised fine-tuning on synthetically rewritten concise traces that better align with dLLM generation patterns. With additional techniques including entropy-based filtering, our training recipe yields substantial gains across three mathematical benchmarks--GSM8K, Math500, and AMC--achieving new state-of-the-art results for full-attention masked dLLMs.