Bootstrapping Language Models with DPO Implicit Rewards

Changyu Chen, Zichen Liu, Chao Du, Tianyu Pang, Qian Liu, Arunesh Sinha, Pradeep Varakantham, Min Lin

2024-06-19

Bootstrapping Language Models with DPO Implicit Rewards

Summary

This paper discusses a new method called DICE (self-alignment with DPO Implicit Rewards) that improves how large language models (LLMs) align with human preferences. It uses a technique called Direct Preference Optimization (DPO) to enhance the performance of these models without needing external feedback.

What's the problem?

Aligning LLMs with human values and preferences is important for making them more useful and effective. Traditional methods, like reinforcement learning from human feedback (RLHF), can be complex and require a lot of data to train reward models. This makes the process slow and sometimes inefficient, as it involves multiple steps to gather and analyze feedback.

What's the solution?

The authors propose using the implicit rewards generated by DPO after training to create a new dataset of preferences. This dataset can then be used in further rounds of DPO to refine the model's alignment with human preferences. They also introduce techniques to improve the quality of this dataset and reduce biases related to response length. Their method, DICE, has shown significant improvements in performance, achieving better results than other models like Gemini Pro while using fewer resources.

Why it matters?

This research is significant because it simplifies the process of aligning LLMs with human preferences, making it faster and more efficient. By using implicit rewards for self-improvement, DICE can help develop better AI systems that understand and respond to human needs more accurately. This advancement could lead to more effective applications in various fields, including customer service, education, and content creation.

Abstract

Human alignment in large language models (LLMs) is an active area of research. A recent groundbreaking work, direct preference optimization (DPO), has greatly simplified the process from past work in reinforcement learning from human feedback (RLHF) by bypassing the reward learning stage in RLHF. DPO, after training, provides an implicit reward model. In this work, we make a novel observation that this implicit reward model can by itself be used in a bootstrapping fashion to further align the LLM. Our approach is to use the rewards from a current LLM model to construct a preference dataset, which is then used in subsequent DPO rounds. We incorporate refinements that debias the length of the responses and improve the quality of the preference dataset to further improve our approach. Our approach, named self-alignment with DPO ImpliCit rEwards (DICE), shows great improvements in alignment and achieves superior performance than Gemini Pro on AlpacaEval 2, reaching 27.55% length-controlled win rate against GPT-4 Turbo, but with only 8B parameters and no external feedback. Our code is available at https://github.com/sail-sg/dice.

View Paper