MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly
Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, Mark Steedman
2025-05-19
Summary
This paper talks about MMLongBench, a new set of tests made to see how well AI models that handle both images and text can deal with really long and complex input, like big documents or lots of pictures at once.
What's the problem?
The problem is that most current tests only check how these AI models do on one simple task at a time, which doesn’t show if they can actually handle more complicated situations where you need to understand a lot of information over a long period or across many images.
What's the solution?
The researchers created MMLongBench, which is a thorough benchmark that challenges these models with a variety of tasks, different types of images, and longer inputs. This helps to clearly see where the models are strong and where they still struggle when things get more complex.
Why it matters?
This matters because as AI is used for things like reading long documents, analyzing videos, or understanding complex scenes, we need to know if these models can really keep up and be reliable in real-world situations.
Abstract
MMLongBench is a benchmark that evaluates long-context vision-language models across various tasks, image types, and input lengths, revealing that single-task performance is insufficient for gauging overall capability.