Detecting Data Contamination from Reinforcement Learning Post-training for Large Language Models
Yongding Tao, Tian Wang, Yihong Dong, Huanyu Liu, Kechi Zhang, Xiaolong Hu, Ge Li
2025-10-15
Summary
This paper investigates a hidden problem with how we test powerful AI language models, specifically those improved using a technique called Reinforcement Learning. It introduces a new method, Self-Critique, to detect if the test data was accidentally used during the model's training, which would make the test results unreliable.
What's the problem?
When evaluating how good AI language models are, we use benchmarks – sets of questions or tasks. A big issue is 'data contamination,' where some of the benchmark questions might have actually been part of the data the model was trained on. This makes the model seem better than it is because it's essentially 'cheating' by remembering the answers. Existing methods can detect this contamination during initial training stages, but they don't work well after the model has been refined using Reinforcement Learning, a crucial step for improving reasoning abilities. This leaves a significant gap in ensuring fair and accurate evaluations.
What's the solution?
The researchers noticed that after Reinforcement Learning, the model's responses become very predictable and focused on a limited set of answers. They developed Self-Critique, which looks for this narrowing of responses – a 'policy collapse.' Essentially, it checks if the model is relying on a very specific, and potentially memorized, way of answering questions. To help with this, they also created a new benchmark called RL-MIA specifically designed to test for contamination in this Reinforcement Learning phase. Their experiments showed Self-Critique is much better at detecting contamination than previous methods, improving accuracy by up to 30%.
Why it matters?
This research is important because it addresses a critical vulnerability in evaluating advanced AI models. If we can't trust the benchmarks, we can't accurately measure progress or compare different models. Self-Critique provides a way to ensure that reported performance is genuine and reflects the model's true capabilities, especially as Reinforcement Learning becomes more important for building smarter AI systems.
Abstract
Data contamination poses a significant threat to the reliable evaluation of Large Language Models (LLMs). This issue arises when benchmark samples may inadvertently appear in training sets, compromising the validity of reported performance. While detection methods have been developed for the pre-training and Supervised Fine-Tuning stages, a critical research gap exists for the increasingly significant phase of Reinforcement Learning (RL) post-training. As RL post-training becomes pivotal for advancing LLM reasoning, the absence of specialized contamination detection methods in this paradigm presents a critical vulnerability. To address this, we conduct the first systematic study of data detection within RL post-training scenario and propose Self-Critique. Our method is motivated by a key observation: after RL phase, the output entropy distribution of LLMs tends to collapse into highly specific and sparse modes. Self-Critique probes for the underlying policy collapse, i.e., the model's convergence to a narrow reasoning path, which causes this entropy reduction. To facilitate this research, we also introduce RL-MIA, a benchmark constructed to simulate this specific contamination scenario. Extensive experiments show that Self-Critique significantly outperforms baseline methods across multiple models and contamination tasks, achieving an AUC improvement of up to 30%. Whereas existing methods are close to a random guess for RL-phase contamination, our method makes detection possible.