Sample-Efficient Alignment for LLMs

Zichen Liu, Changyu Chen, Chao Du, Wee Sun Lee, Min Lin

2024-11-06

Summary

This paper discusses a new method called Sample-Efficient Alignment (SEA) that helps large language models (LLMs) better match human preferences using less feedback and data.

What's the problem?

Aligning LLMs with what humans want is important, but it usually requires a lot of human feedback, which can be time-consuming and costly. This makes it hard to improve these models effectively and efficiently.

What's the solution?

The researchers formulated the alignment problem using a method from game theory called contextual dueling bandits. They developed SEA, an algorithm that uses a technique called Thompson sampling to efficiently learn from human feedback while minimizing the amount of data needed. SEA was tested on different sizes of language models and showed that it could align the models with human preferences much more effectively than previous methods.

Why it matters?

This work is significant because it allows for faster and cheaper improvements to LLMs, making them more useful and aligned with what people actually want. By reducing the need for extensive human feedback, it opens up new possibilities for deploying LLMs in various applications, from chatbots to content creation.

Abstract

We study methods for efficiently aligning large language models (LLMs) with human preferences given budgeted online feedback. We first formulate the LLM alignment problem in the frame of contextual dueling bandits. This formulation, subsuming recent paradigms such as online RLHF and online DPO, inherently quests for sample-efficient algorithms that incorporate online active exploration. Leveraging insights from bandit theory, we introduce a unified algorithm based on Thompson sampling and highlight its applications in two distinct LLM alignment scenarios. The practical agent that efficiently implements this algorithm, named SEA (Sample-Efficient Alignment), is empirically validated through extensive experiments across three model scales (1B, 2.8B, 6.9B) and three preference learning algorithms (DPO, IPO, SLiC). The results demonstrate that SEA achieves highly sample-efficient alignment with oracle's preferences, outperforming recent active exploration methods for LLMs. Additionally, we release the implementation of SEA together with an efficient codebase designed for online alignment of LLMs, aiming to accelerate future research in this field.

View Paper