ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks
Yan Yang, Dongxu Li, Haoning Wu, Bei Chen, Liu Liu, Liyuan Pan, Junnan Li
2025-03-11
Summary
This paper talks about ProBench, a test designed to check how well AI models can handle complex, expert-level tasks that mix images and text, like solving science problems or writing creative stories, by using real-world questions from professionals.
What's the problem?
Current AI models struggle with tasks that need deep knowledge, advanced reasoning, and understanding both visuals and text, especially in specialized fields like medicine or coding.
What's the solution?
ProBench uses 4,000 real expert questions across 10 fields (like science and art) and tests AI models using a ‘judge’ AI to rate their answers. It shows where models fail, like missing details in images or needing better reasoning skills.
Why it matters?
This helps improve AI tools for jobs requiring expert knowledge, like medical diagnosis or coding assistants, by identifying weaknesses and guiding better training methods for smarter, more reliable models.
Abstract
Solving expert-level multimodal tasks is a key milestone towards general intelligence. As the capabilities of multimodal large language models (MLLMs) continue to improve, evaluation of such advanced multimodal intelligence becomes necessary yet challenging. In this work, we introduce ProBench, a benchmark of open-ended user queries that require professional expertise and advanced reasoning. ProBench consists of 4,000 high-quality samples independently submitted by professionals based on their daily productivity demands. It spans across 10 fields and 56 sub-fields, including science, arts, humanities, coding, mathematics, and creative writing. Experimentally, we evaluate and compare 24 latest models using MLLM-as-a-Judge. Our results reveal that although the best open-source models rival the proprietary ones, ProBench presents significant challenges in visual perception, textual understanding, domain knowledge and advanced reasoning, thus providing valuable directions for future multimodal AI research efforts.