Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

Xiao Liang, Zhongzhi Li, Yeyun Gong, Yelong Shen, Ying Nian Wu, Zhijiang Guo, Weizhu Chen

2025-08-25

Beyond Pass@1: Self-Play with Variational Problem Synthesis Sustains RLVR

Summary

This paper focuses on improving how we teach large language models (LLMs) to solve complex problems using a technique called Reinforcement Learning with Verifiable Rewards (RLVR). The goal is to make these models not only get the right answer sometimes, but to also be consistently good at reasoning and provide diverse solutions.

What's the problem?

When using RLVR to train LLMs, a common issue arises: the models get better at giving *a* correct answer, but they start to become less creative and explore fewer different ways to solve the problem. This means while they might get the top answer right often (Pass@1), their overall potential to find *any* correct answer within a larger set of tries (Pass@k) doesn't improve as much as it could, and can even decrease. Essentially, they become too focused and lose their ability to think outside the box.

What's the solution?

The researchers discovered that continually adding and changing the practice problems the LLM is given during training helps prevent this 'focusing' problem. They developed a method called Self-play with Variational problem Synthesis (SvS). This method uses the LLM’s successful solutions to automatically create new, similar problems, but keeps the correct answers the same. This allows the model to practice on a wider range of challenges without losing its understanding of what a correct solution looks like, maintaining its ability to explore different approaches.

Why it matters?

This research is important because it shows a way to significantly improve the reasoning abilities of LLMs. By preventing the models from becoming too rigid in their thinking, SvS allows them to achieve much higher accuracy on challenging reasoning tasks, demonstrated by substantial improvements on difficult benchmarks. This means LLMs can become more reliable and versatile problem-solvers, which is crucial for their use in real-world applications.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has recently emerged as a key paradigm for post-training Large Language Models (LLMs), particularly for complex reasoning tasks. However, vanilla RLVR training has been shown to improve Pass@1 performance at the expense of policy entropy, leading to reduced generation diversity and limiting the Pass@k performance, which typically represents the upper bound of LLM reasoning capability. In this paper, we systematically analyze the policy's generation diversity from the perspective of training problems and find that augmenting and updating training problems helps mitigate entropy collapse during training. Based on these observations, we propose an online Self-play with Variational problem Synthesis (SvS) strategy for RLVR training, which uses the policy's correct solutions to synthesize variational problems while ensuring their reference answers remain identical to the originals. This self-improving strategy effectively maintains policy entropy during training and substantially improves Pass@k compared with standard RLVR, sustaining prolonged improvements and achieving absolute gains of 18.3% and 22.8% in Pass@32 performance on the competition-level AIME24 and AIME25 benchmarks. Experiments on 12 reasoning benchmarks across varying model sizes from 3B to 32B consistently demonstrate the generalizability and robustness of SvS.

View Paper