ProBench: Judging Multimodal Foundation Models on Open-ended Multi-domain Expert Tasks

Yan Yang, Dongxu Li, Haoning Wu, Bei Chen, Liu Liu, Liyuan Pan, Junnan Li

2025-03-11

ProBench: Judging Multimodal Foundation Models on Open-ended
Multi-domain Expert Tasks

Summary

This paper talks about ProBench, a test designed to check how well AI models can handle complex, expert-level tasks that mix images and text, like solving science problems or writing creative stories, by using real-world questions from professionals.

What's the problem?

Current AI models struggle with tasks that need deep knowledge, advanced reasoning, and understanding both visuals and text, especially in specialized fields like medicine or coding.

What's the solution?

ProBench uses 4,000 real expert questions across 10 fields (like science and art) and tests AI models using a ‘judge’ AI to rate their answers. It shows where models fail, like missing details in images or needing better reasoning skills.

Why it matters?

This helps improve AI tools for jobs requiring expert knowledge, like medical diagnosis or coding assistants, by identifying weaknesses and guiding better training methods for smarter, more reliable models.

Abstract

Solving expert-level multimodal tasks is a key milestone towards general intelligence. As the capabilities of multimodal large language models (MLLMs) continue to improve, evaluation of such advanced multimodal intelligence becomes necessary yet challenging. In this work, we introduce ProBench, a benchmark of open-ended user queries that require professional expertise and advanced reasoning. ProBench consists of 4,000 high-quality samples independently submitted by professionals based on their daily productivity demands. It spans across 10 fields and 56 sub-fields, including science, arts, humanities, coding, mathematics, and creative writing. Experimentally, we evaluate and compare 24 latest models using MLLM-as-a-Judge. Our results reveal that although the best open-source models rival the proprietary ones, ProBench presents significant challenges in visual perception, textual understanding, domain knowledge and advanced reasoning, thus providing valuable directions for future multimodal AI research efforts.

View Paper