Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Yuheng Zhang, Dian Yu, Baolin Peng, Linfeng Song, Ye Tian, Mingyue Huo, Nan Jiang, Haitao Mi, Dong Yu

2024-07-01

Iterative Nash Policy Optimization: Aligning LLMs with General Preferences via No-Regret Learning

Summary

This paper talks about a new method called Iterative Nash Policy Optimization (INPO) that helps align large language models (LLMs) with human preferences. It focuses on improving how these models understand and respond to user feedback by using a game-like approach.

What's the problem?

While Reinforcement Learning with Human Feedback (RLHF) has been effective in making LLMs more aligned with what people want, many existing methods rely on reward systems that may not fully capture the complexity of human preferences. This can lead to models that don’t always provide the best responses or understand user intentions accurately. Additionally, traditional methods often require a lot of computational resources to estimate how well the model is performing.

What's the solution?

To address these issues, the authors propose INPO, which treats the alignment of LLMs as a two-player game where the model learns by playing against itself. This 'no-regret' learning approach allows the model to improve without needing to calculate expected win rates for every response, which saves time and resources. Instead, it minimizes a new loss function based on user preferences directly from a dataset. The authors tested INPO using an advanced language model and found that it significantly improved performance on various tasks compared to previous methods.

Why it matters?

This research is important because it provides a more efficient way to train LLMs to better understand and respond to human preferences. By using a game-theoretic approach, INPO can help create AI systems that are more responsive and aligned with what users actually want, leading to better interactions in applications like chatbots, virtual assistants, and other AI-driven tools.

Abstract

Reinforcement Learning with Human Feedback (RLHF) has achieved great success in aligning large language models (LLMs) with human preferences. Prevalent RLHF approaches are reward-based, following the Bradley-Terry (BT) model assumption, which may not fully capture the complexity of human preferences. In this paper, we explore RLHF under a general preference framework and approach it from a game-theoretic perspective. Specifically, we formulate the problem as a two-player game and propose a novel algorithm, iterative Nash policy optimization (INPO). The key idea is to let the policy play against itself via no-regret learning, thereby approximating the Nash policy. Unlike previous methods, INPO bypasses the need for estimating the expected win rate for individual responses, which typically incurs high computational or annotation costs. Instead, we introduce a new loss objective that is directly minimized over a preference dataset. We provide theoretical analysis for our approach and demonstrate its effectiveness through experiments on various representative benchmarks. With an LLaMA-3-8B-based SFT model, INPO achieves a 41.5% length-controlled win rate on AlpacaEval 2.0 and a 38.3% win rate on Arena-Hard, showing substantial improvement over the state-of-the-art iterative algorithm [Dong et al., 2024] under the BT model assumption. Additionally, our ablation study highlights the benefits of incorporating KL regularization for response length control.

View Paper