LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language Diffusion Models

Fengqi Zhu, Rongzhen Wang, Shen Nie, Xiaolu Zhang, Chunwei Wu, Jun Hu, Jun Zhou, Jianfei Chen, Yankai Lin, Ji-Rong Wen, Chongxuan Li

2025-05-27

LLaDA 1.5: Variance-Reduced Preference Optimization for Large Language
Diffusion Models

Summary

This paper talks about LLaDA 1.5, which introduces a new way to train large language diffusion models so they can better match what people actually want or prefer. The new method, called variance-reduced preference optimization (VRPO), helps these models perform better on different tests by making their responses more in line with human expectations.

What's the problem?

The problem is that even advanced language models often give answers or generate content that doesn't always fit what people like or expect. This happens because it's hard to train these models to consistently align with human preferences, and existing training methods can be unstable or unreliable.

What's the solution?

The authors developed VRPO, a special training approach that reduces randomness and errors during the learning process, making it easier for the model to learn what kinds of responses people prefer. By using VRPO with masked diffusion models, the system becomes more stable and produces answers that are more likely to satisfy users.

Why it matters?

This is important because it means AI models can become more useful and trustworthy in real-world situations, like answering questions, writing stories, or helping with tasks, since their outputs will better reflect what people actually want.

Abstract

VRPO is a variance-reduced preference optimization framework for Masked Diffusion Models that significantly enhances their alignment with human preferences and performance across various benchmarks.

View Paper