A Contextual Quality Reward Model for Reliable and Efficient Best-of-N Sampling
Hyung Gyu Rho
2025-10-08
Summary
This paper focuses on improving how we align large language models with what humans actually want, specifically addressing a weakness in current methods where the model might pick the *least bad* answer instead of a genuinely *good* one.
What's the problem?
Current methods for teaching AI what we prefer, like comparing two responses and choosing the better one, only learn relative preferences. They don't teach the AI to recognize when *all* the options are unacceptable. This is a big issue when asking difficult questions, because the AI might just select the slightly less flawed response, even if it's still a bad answer overall, leading to unreliable results.
What's the solution?
The researchers introduced a new way to collect data and train the AI. They added an 'outside option' – essentially letting the AI say 'none of these are good enough.' This helps the AI learn to distinguish between what's better and what's simply acceptable. They also created a smart search strategy called 'best of mini-N in-loop' that checks responses in stages and stops searching once it finds something good, saving time and resources.
Why it matters?
This work is important because it makes AI systems more reliable. By teaching them to reject bad options, we can avoid getting unsatisfactory or even harmful responses. It also makes these systems faster and more efficient, as they don't waste time evaluating options that are clearly not good enough, offering a way to balance quality and speed.
Abstract
Modern preference alignment techniques, such as Best-of-N (BoN) sampling, rely on reward models trained with pairwise comparison data. While effective at learning relative preferences, this paradigm fails to capture a signal of response acceptability, leaving systems vulnerable to selecting the least bad of many unacceptable options. This is particularly problematic for hard prompts, where the risk of such false acceptances increases with the number of samples. In this paper, we address this critical reliability gap by introducing a new data collection and modeling framework. By augmenting preference data with an outside option, inspired by discrete choice models, we train a reward model that can distinguish not just what is better, but what is good enough. We leverage this capability to create an adaptive inference strategy, best of mini-N in-loop, which partitions the generation budget into sequential loops with a calibrated, early-exit condition. Our experiments show that when tuned as an alignment guardrail, it reduces reliability failures by 70\%, and when tuned as an inference accelerator, it improves average inference speed by over 22\% in IMDB-sentiment setting. We thus provide a principled and flexible framework for practitioners to explicitly manage the trade-off between reliability and computational efficiency.