Data-Efficient RLVR via Off-Policy Influence Guidance

Erle Zhu, Dazhi Jiang, Yuan Wang, Xujun Li, Jiale Cheng, Yuxian Gu, Yilin Niu, Aohan Zeng, Jie Tang, Minlie Huang, Hongning Wang

2025-11-04

Data-Efficient RLVR via Off-Policy Influence Guidance

Summary

This paper focuses on making Reinforcement Learning with Verifiable Rewards (RLVR) more efficient for large language models (LLMs). It's about figuring out which training examples are *most* helpful for the LLM to learn from, so you don't waste time on examples that don't contribute much.

What's the problem?

Currently, choosing which data to use for training LLMs in RLVR is done using guesswork or simple rules of thumb. These methods aren't based on solid mathematical principles and don't always work well across different situations. Also, figuring out how much each piece of data helps the LLM learn requires a lot of computation, especially when dealing with huge models, making it slow and expensive.

What's the solution?

The researchers came up with a new method that uses something called 'influence functions' to estimate how much each training example affects the LLM's learning. To make this practical, they developed a way to do this *without* needing to constantly run the LLM during the estimation process, which saves a lot of time. They also used a technique called 'sparse random projection' to simplify the complex calculations needed for large LLMs, making everything faster and requiring less memory. This all comes together in a system called CROPI, which picks the most impactful data step-by-step during training.

Why it matters?

This work is important because it shows a way to significantly speed up the training of large language models used in RLVR. By intelligently selecting the most useful data, they were able to achieve faster learning with much less data overall. This means we can train powerful LLMs more efficiently, which is crucial as these models continue to grow in size and complexity.

Abstract

Data selection is a critical aspect of Reinforcement Learning with Verifiable Rewards (RLVR) for enhancing the reasoning capabilities of large language models (LLMs). Current data selection methods are largely heuristic-based, lacking theoretical guarantees and generalizability. This work proposes a theoretically-grounded approach using influence functions to estimate the contribution of each data point to the learning objective. To overcome the prohibitive computational cost of policy rollouts required for online influence estimation, we introduce an off-policy influence estimation method that efficiently approximates data influence using pre-collected offline trajectories. Furthermore, to manage the high-dimensional gradients of LLMs, we employ sparse random projection to reduce dimensionality and improve storage and computation efficiency. Leveraging these techniques, we develop Curriculum RL with Off-Policy Influence guidance (CROPI), a multi-stage RL framework that iteratively selects the most influential data for the current policy. Experiments on models up to 7B parameters demonstrate that CROPI significantly accelerates training. On a 1.5B model, it achieves a 2.66x step-level acceleration while using only 10\% of the data per stage compared to full-dataset training. Our results highlight the substantial potential of influence-based data selection for efficient RLVR.

View Paper