Skywork-Reward-V2: Scaling Preference Data Curation via Human-AI Synergy
Chris Yuhao Liu, Liang Zeng, Yuzhen Xiao, Jujie He, Jiacai Liu, Chaojie Wang, Rui Yan, Wei Shen, Fuxiang Zhang, Jiacheng Xu, Yang Liu, Yahui Zhou
2025-07-04
Summary
This paper talks about Skywork-Reward-V2, a set of AI reward models built using a large and carefully curated dataset of human preferences. These models were created by combining human judgment with AI tools to improve how well the models understand and respond to preferences.
What's the problem?
The problem is that AI models often struggle to learn human preferences accurately because collecting and labeling preference data is time-consuming, expensive, and sometimes inconsistent when done only by humans or AI alone.
What's the solution?
The researchers developed a human-AI synergy approach that uses both humans and AI working together to curate and improve preference data at a large scale. This leads to better quality training data, which helps the reward models achieve top performance on many tests.
Why it matters?
This matters because reward models guide AI behavior, so improving their accuracy and generalization with better preference data means AI systems become more aligned with what humans want, leading to safer and more useful AI applications.
Abstract
Skywork-Reward-V2, a suite of reward models trained on a large-scale, high-quality preference dataset, achieves state-of-the-art performance across various benchmarks by leveraging human-AI synergistic curation.