Self-Boosting Large Language Models with Synthetic Preference Data

Qingxiu Dong, Li Dong, Xingxing Zhang, Zhifang Sui, Furu Wei

2024-10-10

Self-Boosting Large Language Models with Synthetic Preference Data

Summary

This paper introduces SynPO, a new method that helps large language models (LLMs) improve their responses by using synthetic preference data instead of relying on human-generated data.

What's the problem?

Collecting high-quality preference data, which shows how humans want models to respond, is expensive and time-consuming. This makes it difficult to continuously improve LLMs, as they need this data to learn how to generate better, more helpful responses.

What's the solution?

SynPO works by creating a system where the model generates its own prompts and refines its responses through an iterative process. It uses a self-prompt generator to create a variety of prompts and a response improver to enhance the model's answers. This allows the LLM to learn from its own outputs without needing extensive human feedback. After going through this process multiple times, models like Llama3-8B and Mistral-7B showed significant improvements in following instructions and overall performance on various tasks.

Why it matters?

This research is important because it provides a way for AI models to become better at understanding and responding to human preferences without the heavy reliance on human-generated data. By using synthetic data, SynPO can make the training process faster and more efficient, ultimately leading to smarter AI that can assist users more effectively in real-world applications.

Abstract

Through alignment with human preferences, Large Language Models (LLMs) have advanced significantly in generating honest, harmless, and helpful responses. However, collecting high-quality preference data is a resource-intensive and creativity-demanding process, especially for the continual improvement of LLMs. We introduce SynPO, a self-boosting paradigm that leverages synthetic preference data for model alignment. SynPO employs an iterative mechanism wherein a self-prompt generator creates diverse prompts, and a response improver refines model responses progressively. This approach trains LLMs to autonomously learn the generative rewards for their own outputs and eliminates the need for large-scale annotation of prompts and human preferences. After four SynPO iterations, Llama3-8B and Mistral-7B show significant enhancements in instruction-following abilities, achieving over 22.1% win rate improvements on AlpacaEval 2.0 and ArenaHard. Simultaneously, SynPO improves the general performance of LLMs on various tasks, validated by a 3.2 to 5.0 average score increase on the well-recognized Open LLM leaderboard.

View Paper