Skywork-VL Reward: An Effective Reward Model for Multimodal Understanding and Reasoning
Xiaokun Wang, Chris, Jiangbo Pei, Wei Shen, Yi Peng, Yunzhuo Hao, Weijie Qiu, Ai Jian, Tianyidan Xie, Xuchen Song, Yang Liu, Yahui Zhou
2025-05-13
Summary
This paper talks about Skywork-VL Reward, a new AI model that can judge how well other models understand and reason about both pictures and text by using a special reward system.
What's the problem?
The problem is that it's hard to measure how good AI models are at tasks that require understanding both images and language at the same time, especially when it comes to judging which answers are best based on human preferences.
What's the solution?
The researchers built Skywork-VL Reward using a huge dataset of what people like and a powerful AI architecture called Qwen2.5-VL-7B-Instruct. This reward model can accurately score how well other AI systems do on tasks that mix images and text, setting new records for performance in this area.
Why it matters?
This matters because it helps make sure that AI systems are actually understanding and reasoning about information the way people want them to. Better reward models lead to smarter, more helpful, and more trustworthy AI for things like virtual assistants, education, and creative tools.
Abstract
Skywork-VL Reward is a multimodal reward model that uses a large-scale preference dataset and Qwen2.5-VL-7B-Instruct architecture to achieve state-of-the-art performance in multimodal reasoning.