Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

Yafu Li, Xuyang Hu, Xiaoye Qu, Linjie Li, Yu Cheng

2025-01-23

Test-Time Preference Optimization: On-the-Fly Alignment via Iterative Textual Feedback

Summary

This paper talks about a new method called Test-time Preference Optimization (TPO) that helps large language models (LLMs) better understand and follow human preferences without needing to retrain the entire model. It's like teaching an AI to quickly adapt to what people want in real-time, using feedback in the form of written critiques.

What's the problem?

Large language models are really good at many tasks, but they have a hard time quickly adjusting to what different people want. Usually, to make an AI model better at following specific preferences, you'd need to retrain it, which takes a lot of time and computing power. It's like having a smart student who knows a lot but struggles to quickly change their approach based on a teacher's feedback.

What's the solution?

The researchers created TPO, which works by turning feedback into written critiques that the AI can understand and use to improve its responses right away. Instead of using number scores to rate the AI's performance, TPO uses actual words to explain what needs to be better. The AI then uses these explanations to refine its answers step by step. They tested TPO on various tasks and found that it could make an AI model that wasn't specifically trained for certain tasks perform better than one that was, after just a few rounds of feedback.

Why it matters?

This matters because it could make AI systems much more flexible and responsive to what people need. Instead of having to create new versions of AI models for different tasks or preferences, we could have one model that quickly adapts to what each person wants. This could lead to more personalized AI assistants, better chatbots, and AI systems that are safer and more aligned with human values. It's a big step towards making AI that can understand and respond to human needs more effectively, without the need for constant retraining.

Abstract

Large language models (LLMs) demonstrate impressive performance but lack the flexibility to adapt to human preferences quickly without retraining. In this work, we introduce Test-time Preference Optimization (TPO), a framework that aligns LLM outputs with human preferences during inference, removing the need to update model parameters. Rather than relying on purely numerical rewards, TPO translates reward signals into textual critiques and uses them as textual rewards to iteratively refine its response. Evaluations on benchmarks covering instruction following, preference alignment, safety, and mathematics reveal that TPO progressively improves alignment with human preferences. Notably, after only a few TPO steps, the initially unaligned Llama-3.1-70B-SFT model can surpass the aligned counterpart, Llama-3.1-70B-Instruct. Furthermore, TPO scales efficiently with both the search width and depth during inference. Through case studies, we illustrate how TPO exploits the innate capacity of LLM to interpret and act upon reward signals. Our findings establish TPO as a practical, lightweight alternative for test-time preference optimization, achieving alignment on the fly. Our code is publicly available at https://github.com/yafuly/TPO.

View Paper