Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

Abdelaziz Bounhar, Hadi Abdine, Evan Dufraisse, Ahmad Chamma, Amr Mohamed, Dani Bouch, Michalis Vazirgiannis, Guokan Shang

2025-11-05

Shorter but not Worse: Frugal Reasoning via Easy Samples as Length Regularizers in Math RLVR

Summary

This paper investigates why large language models, when trained to think step-by-step, tend to give overly long and wordy answers, and proposes a way to fix this without directly telling the model to be shorter.

What's the problem?

When you train AI models to solve complex problems by reasoning through steps, they often learn to write extremely long explanations, even for simple tasks. This happens because training focuses on the *hard* problems where longer reasoning is actually needed. The model then mistakenly believes that *all* good solutions require lengthy explanations, increasing the cost of using the model because it generates so much text. Existing training methods often remove easier problems to speed things up, which makes this problem even worse.

What's the solution?

The researchers found that including a moderate number of easier problems during training helps the model learn to be concise. By showing the model examples of problems that can be solved with short explanations, it learns to avoid unnecessary verbosity. They didn't explicitly tell the model to be shorter; it learned brevity on its own, which they call 'emergent brevity'. They tested this on a specific model (Qwen3-4B-Thinking-2507) and found it solved problems just as well, but with answers almost twice as short.

Why it matters?

This research is important because it offers a way to make large language models more efficient and cost-effective. By reducing the length of their responses without sacrificing accuracy, we can lower the computational resources needed to run them, making them more accessible and practical for real-world applications. It shows that carefully choosing the training data can have a significant impact on model behavior, even without directly modifying the model's instructions.

Abstract

Large language models (LLMs) trained for step-by-step reasoning often become excessively verbose, raising inference cost. Standard Reinforcement Learning with Verifiable Rewards (RLVR) pipelines filter out ``easy'' problems for training efficiency, leaving the model to train primarily on harder problems that require longer reasoning chains. This skews the output length distribution upward, resulting in a model that conflates ``thinking longer'' with ``thinking better''. In this work, we show that retaining and modestly up-weighting moderately easy problems acts as an implicit length regularizer. Exposing the model to solvable short-chain tasks constrains its output distribution and prevents runaway verbosity. The result is \emph{emergent brevity for free}: the model learns to solve harder problems without inflating the output length, despite the absence of any explicit length penalization. RLVR experiments using this approach on Qwen3-4B-Thinking-2507 (with a 16k token limit) achieve baseline pass@1 AIME25 accuracy while generating solutions that are, on average, nearly twice as short. The code is available at https://github.com/MBZUAI-Paris/Frugal-AI{GitHub}, with datasets and models on https://huggingface.co/collections/MBZUAI-Paris/k2-think-mini-68dcfa8b114686a4bd3dc2bc{Hugging Face}.

View Paper