Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning

Md Tanvirul Alam, Nidhi Rastogi

2025-11-03

Limits of Generalization in RLVR: Two Case Studies in Mathematical Reasoning

Summary

This paper investigates how well a technique called Reinforcement Learning with Verifiable Rewards, or RLVR, actually teaches large language models to *reason* mathematically, rather than just find quick tricks to get the right answer.

What's the problem?

Large language models are getting better at math problems, but it's hard to know if they're truly understanding the underlying logic or just memorizing patterns. RLVR is supposed to help by rewarding models for showing their work and getting the correct answer, but this paper questions whether it's actually leading to genuine reasoning skills or just reinforcing shortcuts.

What's the solution?

The researchers tested RLVR on two specific types of math problems – scheduling activities and finding the longest increasing sequence in a set of numbers – where there's a clear, verifiable correct solution. They tried different ways of giving rewards to the model and analyzed whether the improvements in performance came from the model learning better reasoning strategies or just picking up on superficial clues to get the right answer.

Why it matters?

The findings suggest that RLVR, as it currently stands, often helps models improve scores by exploiting easy patterns instead of developing real mathematical reasoning abilities. This highlights the need for better ways to test and measure a model’s true understanding of math, and to create benchmarks that can distinguish between genuine reasoning and clever shortcuts.

Abstract

Mathematical reasoning is a central challenge for large language models (LLMs), requiring not only correct answers but also faithful reasoning processes. Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a promising approach for enhancing such capabilities; however, its ability to foster genuine reasoning remains unclear. We investigate RLVR on two combinatorial problems with fully verifiable solutions: Activity Scheduling and the Longest Increasing Subsequence, using carefully curated datasets with unique optima. Across multiple reward designs, we find that RLVR improves evaluation metrics but often by reinforcing superficial heuristics rather than acquiring new reasoning strategies. These findings highlight the limits of RLVR generalization, emphasizing the importance of benchmarks that disentangle genuine mathematical reasoning from shortcut exploitation and provide faithful measures of progress. Code available at https://github.com/xashru/rlvr-seq-generalization.

View Paper