Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models

Nianyi Lin, Jiajie Zhang, Lei Hou, Juanzi Li

2025-10-15

Boundary-Guided Policy Optimization for Memory-efficient RL of Diffusion Large Language Models

Summary

This paper tackles the challenge of using reinforcement learning to improve large language models that are built using a 'diffusion' process, which are really good at generating text. The main issue is that these models are computationally expensive to work with when trying to optimize them using reinforcement learning.

What's the problem?

When you try to teach these diffusion language models using reinforcement learning, you need to calculate how likely different text outputs are. This calculation is incredibly difficult and requires a lot of computer memory. Existing methods try to estimate this likelihood, but they need to store all the intermediate calculations for each estimate, quickly using up all available memory. This limits how accurately they can estimate the likelihood, and therefore how well the model learns.

What's the solution?

The researchers developed a new method called Boundary-Guided Policy Optimization (BGPO). This method cleverly rewrites the calculations needed for reinforcement learning in a way that uses a constant amount of memory, no matter how many estimations are made. It does this by breaking down the problem into smaller, independent parts. Importantly, BGPO still provides the same learning signal as the original, more memory-intensive methods, but without the memory bottleneck.

Why it matters?

This work is important because it allows for more accurate reinforcement learning of diffusion language models. By overcoming the memory limitations, BGPO enables the use of more data when estimating the likelihood, leading to better performance on complex tasks like solving math problems, writing code, and planning. This means we can build even more powerful and capable language models.

Abstract

A key challenge in applying reinforcement learning (RL) to diffusion large language models (dLLMs) lies in the intractability of their likelihood functions, which are essential for the RL objective, necessitating corresponding approximation in each training step. While existing methods approximate the log-likelihoods by their evidence lower bounds (ELBOs) via customized Monte Carlo (MC) sampling, the forward computational graphs of all MC samples need to be retained for the gradient computation of non-linear terms in the RL objective, resulting in significant memory overhead. This constraint restricts feasible sample sizes, leading to imprecise likelihood approximations and ultimately distorting the RL objective. To overcome this limitation, we propose Boundary-Guided Policy Optimization (BGPO), a memory-efficient RL algorithm that maximizes a specially constructed lower bound of the ELBO-based objective. This lower bound is carefully designed to satisfy two key properties: (1) Linearity: it is formulated in a linear sum where each term depends only on a single MC sample, thereby enabling gradient accumulation across samples and ensuring constant memory usage; (2) Equivalence: Both the value and gradient of this lower bound are equal to those of the ELBO-based objective in on-policy training, making it also an effective approximation for the original RL objective. These properties allow BGPO to adopt a large MC sample size, resulting in more accurate likelihood approximations and improved RL objective estimation, which in turn leads to enhanced performance. Experiments show that BGPO significantly outperforms previous RL algorithms for dLLMs in math problem solving, code generation, and planning tasks.

View Paper