The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation

Farid Bagirov, Mikhail Arkhipov, Ksenia Sycheva, Evgeniy Glukhov, Egor Bogomolov

2025-10-28

The Best of N Worlds: Aligning Reinforcement Learning with Best-of-N Sampling via max@k Optimisation

Summary

This paper investigates a way to improve how Large Language Models (LLMs) solve complex problems in areas like math and coding, specifically when using a technique called Reinforcement Learning with Verifiable Rewards.

What's the problem?

While using Reinforcement Learning to train LLMs helps them get better at solving problems, it can sometimes make them less creative and diverse in their answers. This means that if you ask the model to generate multiple possible solutions and then pick the best one (a 'Best-of-N' strategy), the quality of those solutions actually *decreases* as you ask for more and more options. The model gets stuck in a rut and doesn't explore enough different approaches.

What's the solution?

The researchers focused on directly improving a metric called 'max@k', which essentially measures the probability of getting at least one correct answer when generating multiple possibilities. They developed a new mathematical formula to guide the learning process, making it more effective at optimizing this metric. Importantly, they made this formula work not just when the model is learning in real-time, but also when it's learning from past experiences, which makes the learning process much more efficient.

Why it matters?

This work is important because it helps LLMs become both more accurate *and* more diverse in their problem-solving abilities. By improving the 'Best-of-N' strategy, the model can generate a wider range of potential solutions, increasing the chances of finding the best possible answer, especially for challenging tasks.

Abstract

The application of Reinforcement Learning with Verifiable Rewards (RLVR) to mathematical and coding domains has demonstrated significant improvements in the reasoning and problem-solving abilities of Large Language Models. Despite its success in single generation problem solving, the reinforcement learning fine-tuning process may harm the model's exploration ability, as reflected in decreased diversity of generations and a resulting degradation of performance during Best-of-N sampling for large N values. In this work, we focus on optimizing the max@k metric, a continuous generalization of pass@k. We derive an unbiased on-policy gradient estimate for direct optimization of this metric. Furthermore, we extend our derivations to the off-policy updates, a common element in modern RLVR algorithms, that allows better sample efficiency. Empirically, we show that our objective effectively optimizes max@k metric in off-policy scenarios, aligning the model with the Best-of-N inference strategy.

View Paper