Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

Gonçalo Faria, Noah A. Smith

2025-04-08

Sample, Don't Search: Rethinking Test-Time Alignment for Language Models

Summary

This paper talks about QAlign, a smart way to improve AI chatbot answers by using extra computer power during response generation without changing the AI's core programming, like helping a student check their work multiple times to get better answers.

What's the problem?

Current methods to refine AI answers either require costly retraining (which isn't always possible) or get worse when using too many checks due to imperfect reward systems that guide improvements.

What's the solution?

QAlign uses a special checking method (like multiple educated guesses) that keeps improving answers the more computer power you give it, without needing to peek inside the AI's brain or retrain it.

Why it matters?

This helps create safer and more accurate AI tools for things like homework help or medical advice, especially when companies can't or won't share how their AI works internally.

Abstract

Increasing test-time computation has emerged as a promising direction for improving language model performance, particularly in scenarios where model finetuning is impractical or impossible due to computational constraints or private model weights. However, existing test-time search methods using a reward model (RM) often degrade in quality as compute scales, due to the over-optimization of what are inherently imperfect reward proxies. We introduce QAlign, a new test-time alignment approach. As we scale test-time compute, QAlign converges to sampling from the optimal aligned distribution for each individual prompt. By adopting recent advances in Markov chain Monte Carlo for text generation, our method enables better-aligned outputs without modifying the underlying model or even requiring logit access. We demonstrate the effectiveness of QAlign on mathematical reasoning benchmarks (GSM8K and GSM-Symbolic) using a task-specific RM, showing consistent improvements over existing test-time compute methods like best-of-n and majority voting. Furthermore, when applied with more realistic RMs trained on the Tulu 3 preference dataset, QAlign outperforms direct preference optimization (DPO), best-of-n, majority voting, and weighted majority voting on a diverse range of datasets (GSM8K, MATH500, IFEval, MMLU-Redux, and TruthfulQA). A practical solution to aligning language models at test time using additional computation without degradation, our approach expands the limits of the capability that can be obtained from off-the-shelf language models without further training.

View Paper