VLRewardBench: A Challenging Benchmark for Vision-Language Generative Reward Models
Lei Li, Yuancheng Wei, Zhihui Xie, Xuqing Yang, Yifan Song, Peiyi Wang, Chenxin An, Tianyu Liu, Sujian Li, Bill Yuchen Lin, Lingpeng Kong, Qi Liu
2024-11-27

Summary
This paper introduces VLRewardBench, a new benchmark designed to evaluate Vision-Language Generative Reward Models (VL-GenRMs), which help assess how well AI systems can understand and generate responses based on both text and images.
What's the problem?
Evaluating VL-GenRMs is challenging because current methods often rely on biased AI-generated labels and do not effectively test the models' capabilities. This can lead to inaccurate assessments of how well these models perform in real-world scenarios, making it hard to improve them.
What's the solution?
VLRewardBench provides a comprehensive evaluation framework that includes a variety of tasks, such as general multimodal queries and visual hallucination detection. The authors created 1,250 high-quality examples through a combination of AI-assisted annotation and human verification. They tested this benchmark on 16 leading models and found that even advanced models like GPT-4o struggled, achieving only 65.4% accuracy, which shows the benchmark's effectiveness in challenging these models.
Why it matters?
This research is important because it establishes a new standard for evaluating VL-GenRMs, helping researchers identify strengths and weaknesses in these models. By providing a rigorous testing environment, VLRewardBench can drive improvements in AI systems that integrate text and images, leading to better performance in applications like virtual assistants, content generation, and interactive media.
Abstract
Vision-language generative reward models (VL-GenRMs) play a crucial role in aligning and evaluating multimodal AI systems, yet their own evaluation remains under-explored. Current assessment methods primarily rely on AI-annotated preference labels from traditional VL tasks, which can introduce biases and often fail to effectively challenge state-of-the-art models. To address these limitations, we introduce VL-RewardBench, a comprehensive benchmark spanning general multimodal queries, visual hallucination detection, and complex reasoning tasks. Through our AI-assisted annotation pipeline combining sample selection with human verification, we curate 1,250 high-quality examples specifically designed to probe model limitations. Comprehensive evaluation across 16 leading large vision-language models, demonstrates VL-RewardBench's effectiveness as a challenging testbed, where even GPT-4o achieves only 65.4% accuracy, and state-of-the-art open-source models such as Qwen2-VL-72B, struggle to surpass random-guessing. Importantly, performance on VL-RewardBench strongly correlates (Pearson's r > 0.9) with MMMU-Pro accuracy using Best-of-N sampling with VL-GenRMs. Analysis experiments uncover three critical insights for improving VL-GenRMs: (i) models predominantly fail at basic visual perception tasks rather than reasoning tasks; (ii) inference-time scaling benefits vary dramatically by model capacity; and (iii) training VL-GenRMs to learn to judge substantially boosts judgment capability (+14.7% accuracy for a 7B VL-GenRM). We believe VL-RewardBench along with the experimental insights will become a valuable resource for advancing VL-GenRMs.