ΔL Normalization: Rethink Loss Aggregation in RLVR

Zhiyuan He, Xufang Luo, Yike Zhang, Yuqing Yang, Lili Qiu

2025-09-10

ΔL Normalization: Rethink Loss Aggregation in RLVR

Summary

This paper introduces a new technique called Delta L Normalization to improve how large language models are trained using a method called Reinforcement Learning with Verifiable Rewards (RLVR). It's all about making the training process more stable and effective.

What's the problem?

When training these language models with RLVR, the length of the responses they generate can vary a lot. This inconsistency causes problems during training because it leads to unstable updates to the model, making it hard for the model to learn consistently. Previous attempts to fix this either weren't accurate enough or still had issues with unstable training.

What's the solution?

The researchers realized the problem was about finding the best way to calculate the 'loss' – essentially, how wrong the model is – without being thrown off by the different response lengths. Delta L Normalization is a new way to calculate this loss that's both accurate and minimizes the fluctuations during training, leading to more stable learning. They proved this works mathematically and showed it in experiments.

Why it matters?

This is important because it makes RLVR a more reliable way to improve the reasoning abilities of large language models. A more stable training process means we can build better and more capable AI systems, and this technique works well regardless of the model's size or the specific task it's trying to accomplish.

Abstract

We propose Delta L Normalization, a simple yet effective loss aggregation method tailored to the characteristic of dynamic generation lengths in Reinforcement Learning with Verifiable Rewards (RLVR). Recently, RLVR has demonstrated strong potential in improving the reasoning capabilities of large language models (LLMs), but a major challenge lies in the large variability of response lengths during training, which leads to high gradient variance and unstable optimization. Although previous methods such as GRPO, DAPO, and Dr. GRPO introduce different loss normalization terms to address this issue, they either produce biased estimates or still suffer from high gradient variance. By analyzing the effect of varying lengths on policy loss both theoretically and empirically, we reformulate the problem as finding a minimum-variance unbiased estimator. Our proposed Delta L Normalization not only provides an unbiased estimate of the true policy loss but also minimizes gradient variance in theory. Extensive experiments show that it consistently achieves superior results across different model sizes, maximum lengths, and tasks. Our code will be made public at https://github.com/zerolllin/Delta-L-Normalization.

View Paper