Visual Haystacks: Answering Harder Questions About Sets of Images
Tsung-Han Wu, Giscard Biamby, Jerome Quenum, Ritwik Gupta, Joseph E. Gonzalez, Trevor Darrell, David M. Chan
2024-07-23

Summary
This paper introduces Visual Haystacks (VHs), a new benchmark designed to evaluate how well large multimodal models (LMMs) can answer complex questions about sets of images. It focuses on improving the ability of these models to process and reason about multiple images, similar to real-world situations like searching through photo albums or analyzing satellite images.
What's the problem?
While recent advancements have improved LMMs' performance in answering questions about single images, they struggle with tasks that require understanding and reasoning across multiple images. This limitation affects their usefulness in practical applications where users need to find specific information among large collections of unrelated images. Existing benchmarks do not adequately test these capabilities, leaving a gap in understanding how well these models can perform in complex scenarios.
What's the solution?
To address this issue, the authors created the Visual Haystacks benchmark, which includes a variety of tasks that require models to retrieve relevant images and answer questions based on them. The benchmark consists of around 1,000 question-answer pairs and challenges models to identify specific objects or relationships across different images. Additionally, they introduced a new framework called MIRAGE (Multi-Image Retrieval Augmented Generation) that enhances the efficiency and accuracy of answering multi-image questions by integrating advanced image encoding and relevance filtering techniques.
Why it matters?
This research is significant because it pushes the boundaries of what LMMs can do with visual data. By providing a comprehensive benchmark like Visual Haystacks, researchers can better evaluate and improve AI systems for real-world applications, such as environmental monitoring, content retrieval, and more. The findings highlight critical areas for development in AI technology, ultimately leading to smarter systems capable of handling complex visual information.
Abstract
Recent advancements in Large Multimodal Models (LMMs) have made significant progress in the field of single-image visual question answering. However, these models face substantial challenges when tasked with queries that span extensive collections of images, similar to real-world scenarios like searching through large photo albums, finding specific information across the internet, or monitoring environmental changes through satellite imagery. This paper explores the task of Multi-Image Visual Question Answering (MIQA): given a large set of images and a natural language query, the task is to generate a relevant and grounded response. We propose a new public benchmark, dubbed "Visual Haystacks (VHs)," specifically designed to evaluate LMMs' capabilities in visual retrieval and reasoning over sets of unrelated images, where we perform comprehensive evaluations demonstrating that even robust closed-source models struggle significantly. Towards addressing these shortcomings, we introduce MIRAGE (Multi-Image Retrieval Augmented Generation), a novel retrieval/QA framework tailored for LMMs that confronts the challenges of MIQA with marked efficiency and accuracy improvements over baseline methods. Our evaluation shows that MIRAGE surpasses closed-source GPT-4o models by up to 11% on the VHs benchmark and offers up to 3.4x improvements in efficiency over text-focused multi-stage approaches.