DiRL: An Efficient Post-Training Framework for Diffusion Language Models

Ying Zhu, Jiaxin Wan, Xiaoran Liu, Siyanag He, Qiqi Wang, Xu Guo, Tianyi Liang, Zengfeng Huang, Ziwei He, Xipeng Qiu

2025-12-30

DiRL: An Efficient Post-Training Framework for Diffusion Language Models

Summary

This paper introduces a new way to improve diffusion language models, which are a newer type of AI model for understanding and generating text, specifically focusing on making them better at complex tasks like solving math problems.

What's the problem?

While diffusion language models are fast at generating text once they're trained, getting them to perform well on difficult tasks requiring reasoning, like math, is a challenge. Current methods for improving these models after their initial training are slow and don't always align with how the model actually works when it's being used, leading to poor performance on these complex problems.

What's the solution?

The researchers developed a system called DiRL, which stands for Diffusion Reinforcement Learning. It's designed to efficiently update the model after its initial training using a two-step process: first, they fine-tune it with examples, and then they use reinforcement learning to further improve its performance. A key part of DiRL is using techniques to speed up both the training and the actual use of the model, making the whole process much more practical. They also created a new reinforcement learning method, DiPO, specifically for diffusion language models.

Why it matters?

This work is important because it shows how to effectively improve diffusion language models, making them competitive with, and even better than, other types of models on challenging tasks like math. This advancement could lead to more powerful AI systems capable of tackling complex problems in fields like science, engineering, and education.

Abstract

Diffusion Language Models (dLLMs) have emerged as promising alternatives to Auto-Regressive (AR) models. While recent efforts have validated their pre-training potential and accelerated inference speeds, the post-training landscape for dLLMs remains underdeveloped. Existing methods suffer from computational inefficiency and objective mismatches between training and inference, severely limiting performance on complex reasoning tasks such as mathematics. To address this, we introduce DiRL, an efficient post-training framework that tightly integrates FlexAttention-accelerated blockwise training with LMDeploy-optimized inference. This architecture enables a streamlined online model update loop, facilitating efficient two-stage post-training (Supervised Fine-Tuning followed by Reinforcement Learning). Building on this framework, we propose DiPO, the first unbiased Group Relative Policy Optimization (GRPO) implementation tailored for dLLMs. We validate our approach by training DiRL-8B-Instruct on high-quality math data. Our model achieves state-of-the-art math performance among dLLMs and surpasses comparable models in the Qwen2.5 series on several benchmarks.

View Paper