Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

Chris Yuhao Liu, Liang Zeng, Jiacai Liu, Rui Yan, Jujie He, Chaojie Wang, Shuicheng Yan, Yang Liu, Yahui Zhou

2024-10-25

Skywork-Reward: Bag of Tricks for Reward Modeling in LLMs

Summary

This paper discusses Skywork-Reward, a set of techniques designed to improve how large language models (LLMs) learn from data about user preferences, focusing on creating a high-quality dataset for training.

What's the problem?

Many existing datasets used to train LLMs are large but not always high-quality, which can lead to poor performance in understanding user preferences. Additionally, collecting and annotating preference data can be expensive and time-consuming. This makes it difficult to create effective models that accurately reflect what users want.

What's the solution?

The authors introduce a new approach called Skywork-Reward, which involves carefully selecting and filtering data to create a smaller but high-quality dataset of only 80,000 preference pairs. They developed two models, Skywork-Reward-Gemma-27B and Skywork-Reward-Llama-3.1-8B, that use this curated dataset to improve their performance. Their techniques allow these models to achieve top rankings on the RewardBench leaderboard, demonstrating that quality data can lead to better results even with less information.

Why it matters?

This research is significant because it shows that focusing on the quality of training data can be more effective than simply using larger datasets. By creating a more efficient way to train models using high-quality preference data, Skywork-Reward can help improve AI systems in various applications, such as chatbots and recommendation systems, making them more responsive to user needs.

Abstract

In this report, we introduce a collection of methods to enhance reward modeling for LLMs, focusing specifically on data-centric techniques. We propose effective data selection and filtering strategies for curating high-quality open-source preference datasets, culminating in the Skywork-Reward data collection, which contains only 80K preference pairs -- significantly smaller than existing datasets. Using this curated dataset, we developed the Skywork-Reward model series -- Skywork-Reward-Gemma-27B and Skywork-Reward-Llama-3.1-8B -- with the former currently holding the top position on the RewardBench leaderboard. Notably, our techniques and datasets have directly enhanced the performance of many top-ranked models on RewardBench, highlighting the practical impact of our contributions in real-world preference learning applications.

View Paper