MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Chaoyou Fu, Yi-Fan Zhang, Shukang Yin, Bo Li, Xinyu Fang, Sirui Zhao, Haodong Duan, Xing Sun, Ziwei Liu, Liang Wang, Caifeng Shan, Ran He

2024-11-27

MME-Survey: A Comprehensive Survey on Evaluation of Multimodal LLMs

Summary

This paper provides a detailed overview of how to evaluate Multimodal Large Language Models (MLLMs), which can understand and generate responses using both text and images.

What's the problem?

As MLLMs become more popular in AI research and applications, there is a growing need for effective evaluation methods. Current evaluation techniques often focus on single tasks and do not adequately assess the diverse capabilities of these models, making it hard to determine how well they perform in real-world scenarios.

What's the solution?

The authors propose a comprehensive survey that covers four main aspects of evaluating MLLMs: different types of benchmarks based on evaluation capabilities, the process of creating these benchmarks, a systematic way to evaluate models using judges and metrics, and future directions for benchmarking. This structured approach helps researchers understand how to assess MLLMs effectively and encourages the development of better evaluation methods.

Why it matters?

This research is important because it helps improve the way we evaluate advanced AI models that can handle multiple types of data. By providing clear guidelines and benchmarks, the authors aim to drive progress in MLLM research, ensuring that these models can be effectively tested and improved for various applications in fields like healthcare, education, and entertainment.

Abstract

As a prominent direction of Artificial General Intelligence (AGI), Multimodal Large Language Models (MLLMs) have garnered increased attention from both industry and academia. Building upon pre-trained LLMs, this family of models further develops multimodal perception and reasoning capabilities that are impressive, such as writing code given a flow chart or creating stories based on an image. In the development process, evaluation is critical since it provides intuitive feedback and guidance on improving models. Distinct from the traditional train-eval-test paradigm that only favors a single task like image classification, the versatility of MLLMs has spurred the rise of various new benchmarks and evaluation methods. In this paper, we aim to present a comprehensive survey of MLLM evaluation, discussing four key aspects: 1) the summarised benchmarks types divided by the evaluation capabilities, including foundation capabilities, model self-analysis, and extented applications; 2) the typical process of benchmark counstruction, consisting of data collection, annotation, and precautions; 3) the systematic evaluation manner composed of judge, metric, and toolkit; 4) the outlook for the next benchmark. This work aims to offer researchers an easy grasp of how to effectively evaluate MLLMs according to different needs and to inspire better evaluation methods, thereby driving the progress of MLLM research.

View Paper