MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly

Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, Mark Steedman

2025-05-19

MMLongBench: Benchmarking Long-Context Vision-Language Models
Effectively and Thoroughly

Summary

This paper talks about MMLongBench, a new set of tests made to see how well AI models that handle both images and text can deal with really long and complex input, like big documents or lots of pictures at once.

What's the problem?

The problem is that most current tests only check how these AI models do on one simple task at a time, which doesn’t show if they can actually handle more complicated situations where you need to understand a lot of information over a long period or across many images.

What's the solution?

The researchers created MMLongBench, which is a thorough benchmark that challenges these models with a variety of tasks, different types of images, and longer inputs. This helps to clearly see where the models are strong and where they still struggle when things get more complex.

Why it matters?

This matters because as AI is used for things like reading long documents, analyzing videos, or understanding complex scenes, we need to know if these models can really keep up and be reliable in real-world situations.

Abstract

MMLongBench is a benchmark that evaluates long-context vision-language models across various tasks, image types, and input lengths, revealing that single-task performance is insufficient for gauging overall capability.

View Paper