Multi-Crit: Benchmarking Multimodal Judges on Pluralistic Criteria-Following
Tianyi Xiong, Yi Ge, Ming Li, Zuolong Zhang, Pranav Kulkarni, Kaishen Wang, Qi He, Zeying Zhu, Chenxi Liu, Ruibo Chen, Tong Zheng, Yanshuo Chen, Xiyao Wang, Renrui Zhang, Wenhu Chen, Heng Huang
2025-11-28
Summary
This paper investigates how well large AI models, capable of understanding both text and images, can act as judges when evaluating other AI systems. These models are becoming popular for this role because they generally follow instructions well and seem to agree with what people prefer.
What's the problem?
Currently, it's unclear how good these AI judges are at following *multiple* and very specific rules when evaluating something. Imagine needing to judge an image caption on both its accuracy and how creative it is – can the AI consistently consider both aspects? Existing methods don't really test this ability to handle many different evaluation criteria at once, and it's important to know if they can reliably judge based on each individual rule.
What's the solution?
The researchers created a new benchmark called Multi-Crit. This benchmark includes challenging pairs of AI-generated responses (like image captions or answers to questions) that have been carefully reviewed by humans who rated them based on several different criteria. They then tested 25 different large AI models on this benchmark, developing new ways to measure how well the models stick to all the rules, how easily they switch between considering different rules, and how they handle situations where the rules conflict. They also experimented with different training techniques to see if they could improve the models' judging abilities.
Why it matters?
This work is important because it highlights the limitations of current AI judges. While they're getting better, they still struggle with complex evaluations that require considering multiple factors. The Multi-Crit benchmark provides a valuable tool for researchers to develop more reliable and controllable AI evaluation systems, ultimately helping us build better and more trustworthy AI.
Abstract
Large multimodal models (LMMs) are increasingly adopted as judges in multimodal evaluation systems due to their strong instruction following and consistency with human preferences. However, their ability to follow diverse, fine-grained evaluation criteria remains underexplored. We develop Multi-Crit, a benchmark for evaluating multimodal judges on their capacity to follow pluralistic criteria and produce reliable criterion-level judgments. Covering both open-ended generation and verifiable reasoning tasks, Multi-Crit is built through a rigorous data curation pipeline that gathers challenging response pairs with multi-criterion human annotations. It further introduces three novel metrics for systematically assessing pluralistic adherence, criterion-switching flexibility, and the ability to recognize criterion-level preference conflicts. Comprehensive analysis of 25 LMMs reveals that 1) proprietary models still struggle to maintain consistent adherence to pluralistic criteria--especially in open-ended evaluation; 2) open-source models lag further behind in flexibly following diverse criteria; and 3) critic fine-tuning with holistic judgment signals enhances visual grounding but fails to generalize to pluralistic criterion-level judgment. Additional analyses on reasoning fine-tuning, test-time scaling, and boundary consistency between open-source and proprietary models further probe the limits of current multimodal judges. As a pioneering study, Multi-Crit lays the foundation for building reliable and steerable multimodal AI evaluation.