Reasoning or Memorization? Unreliable Results of Reinforcement Learning Due to Data Contamination

Mingqi Wu, Zhihao Zhang, Qiaole Dong, Zhiheng Xi, Jun Zhao, Senjie Jin, Xiaoran Fan, Yuhao Zhou, Yanwei Fu, Qin Liu, Songyang Zhang, Qi Zhang

2025-07-15

Reasoning or Memorization? Unreliable Results of Reinforcement Learning
Due to Data Contamination

Summary

This paper talks about how using reinforcement learning to improve reasoning in large language models can give misleading results because the models sometimes memorize parts of the test data they have seen before, known as data contamination.

What's the problem?

The issue is that when test data is accidentally included in the training set, models can look like they are reasoning well by just remembering answers, making benchmark results unreliable and overstating the model's true ability.

What's the solution?

The researchers analyzed how data contamination affects reinforcement learning results and emphasized the need for cleaner benchmarks and careful testing to distinguish true reasoning from simple memorization.

Why it matters?

This matters because understanding the difference between real reasoning and memorization helps researchers build better AI that truly learns and generalizes, ensuring more trustworthy and meaningful progress in AI development.

Abstract

Research on enhancing reasoning capabilities in large language models using reinforcement learning reveals that accurate reward signals are crucial for performance improvement, and current benchmarks may be unreliable due to data contamination.

View Paper