SimKO: Simple Pass@K Policy Optimization
Ruotian Peng, Yi Ren, Zhouliang Yu, Weiyang Liu, Yandong Wen
2025-10-17
Summary
This paper investigates a problem with how large language models are trained to improve their reasoning skills using a technique called Reinforcement Learning with Verifiable Rewards (RLVR). It finds that these models tend to get stuck focusing on only one 'best' answer, even when other answers might also be correct.
What's the problem?
When training these language models, researchers noticed a strange trend: the models got really good at picking the single most likely answer (pass@1), but their performance dropped when asked to consider multiple possible correct answers (pass@K, where K is greater than 1). This suggests the models weren't exploring enough different options and were overly confident in just one. The paper digs into *why* this happens, discovering that during training, the model's probability focuses heavily on just the top predicted word, ignoring other potentially valid choices.
What's the solution?
To fix this, the researchers created a new training method called Simple Pass@K Optimization (SimKO). SimKO works by slightly adjusting the probabilities of different words during training. If the model gets the answer right, it boosts the probabilities of the top few likely words. But, crucially, if the model gets the answer wrong, it *strongly* penalizes the single most likely word. This encourages the model to consider a wider range of possibilities, especially when it's unsure about the correct answer.
Why it matters?
This work is important because it provides a simple and effective way to improve the reasoning abilities of large language models. By encouraging exploration and preventing the model from getting stuck on a single answer, SimKO helps these models perform better on complex tasks that require considering multiple possibilities, like solving math problems or logical puzzles. It's a practical improvement to a key training technique for these powerful AI systems.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has advanced the reasoning capabilities of large language models (LLMs). However, prevailing RLVR methods exhibit a systematic bias toward exploitation over exploration, as evidenced by improved pass@1 but reduced pass@K (K>1) performance. To understand this issue, we analyze training dynamics of RLVR methods by tracking the token-level probability distributions over vocabulary candidates. Our analysis reveals a consistent probability concentration effect where the top-1 candidate increasingly accumulates probability mass and suppresses that of other candidates. More importantly, stronger over-concentration correlates with worse pass@K performance. Inspired by this finding, we propose Simple Pass@K Optimization (SimKO), a method designed to mitigate the over-concentration issue, thereby encouraging exploration. SimKO operates in an asymmetrical manner. For verified-correct responses, it boosts the probabilities of the top-K candidates. For verified-incorrect responses, it applies stronger penalties to the top-1 candidate. We observe that this asymmetric design is particularly effective at mitigating over-concentration when applied at tokens with high entropy. Across various math and logical-reasoning benchmarks, SimKO consistently yields higher pass@K for a wide range of K, providing a simple way to improve RLVR's exploration.