MME-RealWorld: Could Your Multimodal LLM Challenge High-Resolution Real-World Scenarios that are Difficult for Humans?
Yi-Fan Zhang, Huanyu Zhang, Haochen Tian, Chaoyou Fu, Shuangqing Zhang, Junfei Wu, Feng Li, Kun Wang, Qingsong Wen, Zhang Zhang, Liang Wang, Rong Jin, Tieniu Tan
2024-08-26

Summary
This paper introduces MME-RealWorld, a new benchmark designed to test how well Multimodal Large Language Models (MLLMs) can handle complex real-world scenarios that are challenging even for humans.
What's the problem?
Many existing tests for MLLMs do not accurately reflect the difficulties these models face in real-life situations. Problems include having too few examples, relying on low-quality data, and not being challenging enough, especially with low-resolution images. This makes it hard to measure how well these models truly perform.
What's the solution?
To address these issues, the authors collected over 300,000 images from various sources and carefully selected 13,366 high-quality images for detailed analysis. They created 29,429 question-answer pairs covering 43 different tasks across five real-world scenarios. This benchmark is the largest of its kind and focuses on high-resolution images and realistic applications. They tested 28 leading MLLMs to see how well they could perform on these challenging tasks.
Why it matters?
This research is important because it helps improve our understanding of how well MLLMs can operate in complex environments. By identifying the challenges that even advanced models struggle with, researchers can work on developing better systems that can handle real-world tasks more effectively.
Abstract
Comprehensive evaluation of Multimodal Large Language Models (MLLMs) has recently garnered widespread attention in the research community. However, we observe that existing benchmarks present several common barriers that make it difficult to measure the significant challenges that models face in the real world, including: 1) small data scale leads to a large performance variance; 2) reliance on model-based annotations results in restricted data quality; 3) insufficient task difficulty, especially caused by the limited image resolution. To tackle these issues, we introduce MME-RealWorld. Specifically, we collect more than 300K images from public datasets and the Internet, filtering 13,366 high-quality images for annotation. This involves the efforts of professional 25 annotators and 7 experts in MLLMs, contributing to 29,429 question-answer pairs that cover 43 subtasks across 5 real-world scenarios, extremely challenging even for humans. As far as we know, MME-RealWorld is the largest manually annotated benchmark to date, featuring the highest resolution and a targeted focus on real-world applications. We further conduct a thorough evaluation involving 28 prominent MLLMs, such as GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet. Our results show that even the most advanced models struggle with our benchmarks, where none of them reach 60% accuracy. The challenges of perceiving high-resolution images and understanding complex real-world scenarios remain urgent issues to be addressed. The data and evaluation code are released at https://mme-realworld.github.io/ .