Accelerating Direct Preference Optimization with Prefix Sharing

Franklin Wang, Sumanth Hegde

2024-10-30

Accelerating Direct Preference Optimization with Prefix Sharing

Summary

This paper discusses a new method called Direct Preference Optimization (DPO) that improves how large language models (LLMs) learn to align their responses with human preferences without complicated processes.

What's the problem?

Training LLMs to produce responses that match what people want can be difficult. Traditional methods, like reinforcement learning from human feedback (RLHF), are complex and often require a lot of adjustments and data, making them hard to implement effectively. This complexity can lead to instability and inefficiency in training.

What's the solution?

The authors propose DPO as a simpler alternative that focuses on directly optimizing the model's outputs based on human preferences. Instead of generating a separate reward model and using it to guide training, DPO uses a straightforward classification approach. It allows the model to learn from preference data by adjusting the likelihood of preferred responses, making the training process faster and more efficient. The paper shows that DPO can significantly improve training speed and performance compared to traditional methods.

Why it matters?

This research is important because it simplifies the process of aligning LLMs with human preferences, making it easier to create AI systems that respond in ways people find helpful and relevant. By reducing the complexity of training, DPO can make it more accessible for developers to implement effective AI applications in various fields, such as customer service, content creation, and more.

Abstract

Offline paired preference optimization algorithms have become a popular approach for fine-tuning on preference data, outperforming traditional supervised fine-tuning in various tasks. However, traditional implementations often involve redundant computations, especially for tasks with long shared prompts. We introduce prefix sharing for preference tuning, a novel technique that processes chosen and rejected responses as one sequence with a shared prefix. To prevent cross-response contamination, we use a custom block-sparse attention mask. Our method achieves 1.1-1.5times improvement in training throughput on popular DPO datasets, without any effect on convergence. When combined with sequence packing, we observe consistent 1.3-1.6times speedups, benefiting even datasets with smaller sequence lengths. While we focus on Direct Preference Optimization (DPO), our approach is applicable to other paired preference tuning methods. By enhancing computational efficiency, our work contributes to making preference-based fine-tuning more accessible for a wider range of applications and model sizes. We open-source our code at https://github.com/frankxwang/dpo-prefix-sharing.

View Paper