MMEvalPro: Calibrating Multimodal Benchmarks Towards Trustworthy and Efficient Evaluation
Jinsheng Huang, Liang Chen, Taian Guo, Fu Zeng, Yusheng Zhao, Bohan Wu, Ye Yuan, Haozhe Zhao, Zhihui Guo, Yichi Zhang, Jingyang Yuan, Wei Ju, Luchen Liu, Tianyu Liu, Baobao Chang, Ming Zhang
2024-07-02

Summary
This paper talks about MMEvalPro, a new benchmark designed to improve how we evaluate large multimodal models (LMMs) by making sure the tests are fair and reliable. It focuses on enhancing the accuracy of assessments that involve both images and text.
What's the problem?
Current benchmarks for evaluating LMMs often have biases that can lead to misleading results. For example, some large language models (LLMs) that can't actually see images still perform well on tests that include visual elements. This raises questions about the credibility of these evaluations, as it suggests that the tests might not be accurately measuring the models' true abilities.
What's the solution?
To tackle this problem, the authors developed MMEvalPro, which uses a trilogy evaluation pipeline. This means that for each question in existing benchmarks, they create three related questions: one about perception (how well the model understands the image), one about knowledge (what the model knows), and the original question. By doing this, they can better assess whether a model truly understands a problem rather than just guessing. MMEvalPro includes over 6,400 questions, with a significant portion created by human experts to ensure quality.
Why it matters?
This research is important because it provides a more trustworthy way to evaluate LMMs, which is crucial for advancing AI technology. By ensuring that evaluations are fair and reflect actual understanding, MMEvalPro can help researchers develop better models and improve their performance in real-world tasks. This advancement is essential for applications in fields like healthcare, education, and autonomous systems where accurate decision-making is critical.
Abstract
Large Multimodal Models (LMMs) exhibit impressive cross-modal understanding and reasoning abilities, often assessed through multiple-choice questions (MCQs) that include an image, a question, and several options. However, many benchmarks used for such evaluations suffer from systematic biases. Remarkably, Large Language Models (LLMs) without any visual perception capabilities achieve non-trivial performance, undermining the credibility of these evaluations. To address this issue while maintaining the efficiency of MCQ evaluations, we propose MMEvalPro, a benchmark designed to avoid Type-I errors through a trilogy evaluation pipeline and more rigorous metrics. For each original question from existing benchmarks, human annotators augment it by creating one perception question and one knowledge anchor question through a meticulous annotation process. MMEvalPro comprises 2,138 question triplets, totaling 6,414 distinct questions. Two-thirds of these questions are manually labeled by human experts, while the rest are sourced from existing benchmarks (MMMU, ScienceQA, and MathVista). Compared with the existing benchmarks, our experiments with the latest LLMs and LMMs demonstrate that MMEvalPro is more challenging (the best LMM lags behind human performance by 31.73%, compared to an average gap of 8.03% in previous benchmarks) and more trustworthy (the best LLM trails the best LMM by 23.09%, whereas the gap for previous benchmarks is just 14.64%). Our in-depth analysis explains the reason for the large performance gap and justifies the trustworthiness of evaluation, underscoring its significant potential for advancing future research.