Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Michael Noukhovitch, Shengyi Huang, Sophie Xhonneux, Arian Hosseini, Rishabh Agarwal, Aaron Courville

2024-10-25

Asynchronous RLHF: Faster and More Efficient Off-Policy RL for Language Models

Summary

This paper presents Asynchronous RLHF, a new method for training language models that makes the process faster and more efficient by separating how models generate data and how they learn from it.

What's the problem?

The traditional method for Reinforcement Learning from Human Feedback (RLHF) requires models to generate outputs and learn from them at the same time, which can be slow and resource-intensive. This synchronous approach limits how quickly models can be trained and can lead to inefficiencies in using computational resources.

What's the solution?

The authors propose a new approach that separates the generation of new samples from the learning process. This allows the model to generate new data while simultaneously learning from older data, which speeds up training. They explore how much 'off-policy' data (data generated from previous versions of the model) can be used without hurting performance. Their findings show that using certain algorithms, like online DPO, works well with this off-policy data, making the training process more robust and efficient.

Why it matters?

This research is important because it significantly reduces the time needed to train language models while maintaining their performance. By improving the efficiency of RLHF, this method can help create better AI systems faster, making advancements in areas like chatbots, virtual assistants, and other applications that rely on understanding and generating human-like text.

Abstract

The dominant paradigm for RLHF is online and on-policy RL: synchronously generating from the large language model (LLM) policy, labelling with a reward model, and learning using feedback on the LLM's own outputs. While performant, this paradigm is computationally inefficient. Inspired by classical deep RL literature, we propose separating generation and learning in RLHF. This enables asynchronous generation of new samples while simultaneously training on old samples, leading to faster training and more compute-optimal scaling. However, asynchronous training relies on an underexplored regime, online but off-policy RLHF: learning on samples from previous iterations of our model. To understand the challenges in this regime, we investigate a fundamental question: how much off-policyness can we tolerate for asynchronous training to speed up learning but maintain performance? Among several RLHF algorithms we tested, we find that online DPO is most robust to off-policy data, and robustness increases with the scale of the policy model. We study further compute optimizations for asynchronous RLHF but find that they come at a performance cost, giving rise to a trade-off. Finally, we verify the scalability of asynchronous RLHF by training LLaMA 3.1 8B on an instruction-following task 40% faster than a synchronous run while matching final performance.

View Paper