Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

Hengyi Wang, Haizhou Shi, Shiwei Tan, Weiyi Qin, Wenyuan Wang, Tunyu Zhang, Akshay Nambi, Tanuja Ganu, Hao Wang

2024-06-20

Multimodal Needle in a Haystack: Benchmarking Long-Context Capability of Multimodal Large Language Models

Summary

This paper presents the MultiModal Needle-in-a-Haystack (MMNeedle) benchmark, which is designed to test how well multimodal large language models (MLLMs) can handle long contexts. It focuses on evaluating the models' ability to find specific images within a larger set based on text instructions.

What's the problem?

While MLLMs have shown great potential in various tasks, there hasn't been a thorough way to evaluate how they perform with long and complex inputs, especially when it comes to understanding multiple images at once. Existing benchmarks often do not consider the challenges that arise when models need to analyze a lot of information at once, which is crucial for real-world applications.

What's the solution?

To address this issue, the authors developed the MMNeedle benchmark. This benchmark tests MLLMs by asking them to find a specific sub-image (the 'needle') within a larger collection of images (the 'haystack') based on given text descriptions. They also use a technique called image stitching to create longer input contexts, making it more challenging for the models. The benchmark includes a system for automatically labeling images to help evaluate how accurately the models can retrieve the correct sub-image.

Why it matters?

This research is important because it provides a new way to assess the capabilities of MLLMs in handling complex visual information. By focusing on long-context scenarios, MMNeedle helps identify strengths and weaknesses in these models, paving the way for improvements in AI systems that need to understand and interact with multiple types of data simultaneously. This could lead to better applications in fields like robotics, virtual reality, and more advanced AI-driven tools.

Abstract

Multimodal Large Language Models (MLLMs) have shown significant promise in various applications, leading to broad interest from researchers and practitioners alike. However, a comprehensive evaluation of their long-context capabilities remains underexplored. To address these gaps, we introduce the MultiModal Needle-in-a-haystack (MMNeedle) benchmark, specifically designed to assess the long-context capabilities of MLLMs. Besides multi-image input, we employ image stitching to further increase the input context length, and develop a protocol to automatically generate labels for sub-image level retrieval. Essentially, MMNeedle evaluates MLLMs by stress-testing their capability to locate a target sub-image (needle) within a set of images (haystack) based on textual instructions and descriptions of image contents. This setup necessitates an advanced understanding of extensive visual contexts and effective information retrieval within long-context image inputs. With this benchmark, we evaluate state-of-the-art MLLMs, encompassing both API-based and open-source models. The findings reveal that GPT-4o consistently surpasses other models in long-context scenarios, but suffers from hallucination problems in negative samples, i.e., when needles are not in the haystacks. Our comprehensive long-context evaluation of MLLMs also sheds lights on the considerable performance gap between API-based and open-source models. All the code, data, and instructions required to reproduce the main results are available at https://github.com/Wang-ML-Lab/multimodal-needle-in-a-haystack.

View Paper