DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning

Shih-Yang Liu, Xin Dong, Ximing Lu, Shizhe Diao, Mingjie Liu, Min-Hung Chen, Hongxu Yin, Yu-Chiang Frank Wang, Kwang-Ting Cheng, Yejin Choi, Jan Kautz, Pavlo Molchanov

2025-10-20

DLER: Doing Length pEnalty Right - Incentivizing More Intelligence per Token via Reinforcement Learning

Summary

This paper focuses on making large language models, like those from OpenAI and others, more efficient. These models are really good at complex reasoning, but they often give very long-winded answers. The goal is to get the same level of accuracy with fewer words.

What's the problem?

The main issue is that simply shortening the responses of these models using a basic method like cutting them off at a certain length actually makes them less accurate. Researchers thought the problem was that the penalty for length wasn't sophisticated enough, but this paper shows the real problem is that the way these models are *trained* to be concise isn't working well. Specifically, the training process struggles with estimating how good different answers are, tends to produce very predictable (and therefore less helpful) responses, and doesn't get enough clear feedback on whether the answers are actually good.

What's the solution?

The researchers developed a new training method called 'Doing Length pEnalty Right' or DLER. It involves several improvements to the training process: normalizing the rewards the model receives, being more strict about how much the model can change its answers during training, strategically choosing which examples to train on, and using a simple length penalty. They also created a version of DLER that adjusts how much it shortens answers based on how difficult the question is. Finally, they figured out a way to apply these concise reasoning skills to existing models without needing to retrain them from scratch.

Why it matters?

This work is important because it significantly improves the efficiency of large language models. DLER can cut the length of responses by over 70% while maintaining or even improving accuracy. This means faster responses, lower costs, and the ability to process more information with the same computing power. It also makes these models more practical for real-world applications where concise answers are crucial.

Abstract

Reasoning language models such as OpenAI-o1, DeepSeek-R1, and Qwen achieve strong performance via extended chains of thought but often generate unnecessarily long outputs. Maximizing intelligence per token--accuracy relative to response length--remains an open problem. We revisit reinforcement learning (RL) with the simplest length penalty--truncation--and show that accuracy degradation arises not from the lack of sophisticated penalties but from inadequate RL optimization. We identify three key challenges: (i) large bias in advantage estimation, (ii) entropy collapse, and (iii) sparse reward signal. We address them with Doing Length pEnalty Right (DLER), a training recipe combining batch-wise reward normalization, higher clipping, dynamic sampling, and a simple truncation length penalty. DLER achieves state-of-the-art accuracy--efficiency trade-offs, cutting output length by over 70 percent while surpassing all previous baseline accuracy. It also improves test-time scaling: compared to DeepSeek-R1-7B, DLER-7B generates multiple concise responses in parallel with 28 percent higher accuracy and lower latency. We further introduce Difficulty-Aware DLER, which adaptively tightens truncation on easier questions for additional efficiency gains. We also propose an update-selective merging method that preserves baseline accuracy while retaining the concise reasoning ability of the DLER model, which is useful for scenarios where RL training data is scarce.

View Paper