LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Kaichen Zhang, Bo Li, Peiyuan Zhang, Fanyi Pu, Joshua Adrian Cahyono, Kairui Hu, Shuai Liu, Yuanhan Zhang, Jingkang Yang, Chunyuan Li, Ziwei Liu

2024-07-18

LMMs-Eval: Reality Check on the Evaluation of Large Multimodal Models

Summary

This paper introduces LMMs-Eval, a new benchmarking framework designed to evaluate large multimodal models (LMMs) effectively and transparently across various tasks.

What's the problem?

As large multimodal models become more advanced, there is a growing need for reliable and standardized methods to evaluate their performance. Current evaluation methods are often inconsistent and can lead to misleading results because they vary widely in how they collect and analyze data. This lack of standardization makes it difficult to compare different models and understand their strengths and weaknesses.

What's the solution?

LMMs-Eval provides a unified framework that includes over 50 different evaluation tasks and more than 10 models, allowing for transparent and reproducible assessments of multimodal models. It also introduces LMMs-Eval LITE, a more efficient version that focuses on essential tasks while reducing unnecessary data, making evaluations cheaper and faster. Additionally, the framework includes LiveBench, which uses real-time data from news and online forums to test how well models can generalize to new information.

Why it matters?

This research is important because it helps improve the way we evaluate AI models, ensuring that comparisons are fair and consistent. By providing a comprehensive benchmarking system, LMMs-Eval can guide future developments in AI technology, helping researchers and developers create better, more reliable multimodal models that can perform effectively in real-world applications.

Abstract

The advances of large foundation models necessitate wide-coverage, low-cost, and zero-contamination benchmarks. Despite continuous exploration of language model evaluations, comprehensive studies on the evaluation of Large Multi-modal Models (LMMs) remain limited. In this work, we introduce LMMS-EVAL, a unified and standardized multimodal benchmark framework with over 50 tasks and more than 10 models to promote transparent and reproducible evaluations. Although LMMS-EVAL offers comprehensive coverage, we find it still falls short in achieving low cost and zero contamination. To approach this evaluation trilemma, we further introduce LMMS-EVAL LITE, a pruned evaluation toolkit that emphasizes both coverage and efficiency. Additionally, we present Multimodal LIVEBENCH that utilizes continuously updating news and online forums to assess models' generalization abilities in the wild, featuring a low-cost and zero-contamination evaluation approach. In summary, our work highlights the importance of considering the evaluation trilemma and provides practical solutions to navigate the trade-offs in evaluating large multi-modal models, paving the way for more effective and reliable benchmarking of LMMs. We opensource our codebase and maintain leaderboard of LIVEBENCH at https://github.com/EvolvingLMMs-Lab/lmms-eval and https://huggingface.co/spaces/lmms-lab/LiveBench.

View Paper