BOND: Aligning LLMs with Best-of-N Distillation

Pier Giuseppe Sessa, Robert Dadashi, Léonard Hussenot, Johan Ferret, Nino Vieillard, Alexandre Ramé, Bobak Shariari, Sarah Perrin, Abe Friesen, Geoffrey Cideron, Sertan Girgin, Piotr Stanczyk, Andrea Michi, Danila Sinopalnikov, Sabela Ramos, Amélie Héliou, Aliaksei Severyn, Matt Hoffman, Nikola Momchev, Olivier Bachem

2024-07-23

BOND: Aligning LLMs with Best-of-N Distillation

Summary

This paper presents BOND, a new algorithm designed to improve the performance of large language models (LLMs) by using a method called Best-of-N Distillation. The goal is to make LLMs generate high-quality responses while reducing the computational resources needed during inference.

What's the problem?

While reinforcement learning from human feedback (RLHF) has helped improve the quality and safety of LLMs, traditional methods like Best-of-N sampling require a lot of computational power because they evaluate multiple candidate responses to choose the best one. This makes it hard to use them efficiently in real-time applications where quick responses are needed.

What's the solution?

BOND addresses this issue by creating a new RLHF algorithm that mimics the Best-of-N sampling process without needing to evaluate multiple candidates at once. Instead, it uses a technique called distribution matching to adjust the model's output to be more like what Best-of-N would produce. By employing Jeffreys divergence, BOND balances how the model covers different possible outputs while also focusing on specific desirable outcomes. This allows BOND to achieve strong performance with only a single output sample during inference, making it much more efficient.

Why it matters?

This research is significant because it enhances the efficiency of LLMs, enabling them to provide high-quality answers quickly and with less computational cost. This improvement can lead to better performance in various applications, such as chatbots, virtual assistants, and any system that relies on fast and accurate text generation.

Abstract

Reinforcement learning from human feedback (RLHF) is a key driver of quality and safety in state-of-the-art large language models. Yet, a surprisingly simple and strong inference-time strategy is Best-of-N sampling that selects the best generation among N candidates. In this paper, we propose Best-of-N Distillation (BOND), a novel RLHF algorithm that seeks to emulate Best-of-N but without its significant computational overhead at inference time. Specifically, BOND is a distribution matching algorithm that forces the distribution of generations from the policy to get closer to the Best-of-N distribution. We use the Jeffreys divergence (a linear combination of forward and backward KL) to balance between mode-covering and mode-seeking behavior, and derive an iterative formulation that utilizes a moving anchor for efficiency. We demonstrate the effectiveness of our approach and several design choices through experiments on abstractive summarization and Gemma models. Aligning Gemma policies with BOND outperforms other RLHF algorithms by improving results on several benchmarks.

View Paper