CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Jiaqi Wang, Xiao Yang, Kai Sun, Parth Suresh, Sanat Sharma, Adam Czyzewski, Derek Andersen, Surya Appini, Arkav Banerjee, Sajal Choudhary, Shervin Ghasemlou, Ziqiang Guan, Akil Iyer, Haidar Khan, Lingkun Kong, Roy Luo, Tiffany Ma, Zhen Qiao, David Tran, Wenfang Xu, Skyler Yeatman, Chen Zhou

2025-10-31

CRAG-MM: Multi-modal Multi-turn Comprehensive RAG Benchmark

Summary

This paper introduces a new testing ground, called CRAG-MM, to evaluate how well computer systems can answer questions about what someone is *seeing* through wearable devices like smart glasses, using both images and conversations.

What's the problem?

Currently, there isn't a good way to thoroughly test systems designed to answer questions based on what a wearable camera sees. Existing tests don't accurately reflect the challenges of real-world wearable use, like imperfect image quality or needing to remember information from previous questions in a conversation. It's hard to know how much these systems actually understand and how much they just get lucky.

What's the solution?

The researchers created CRAG-MM, a large collection of over 6,500 images with questions and answers, and 2,000 multi-turn conversations. These images are specifically designed to look like what you'd see through a wearable device. The questions are varied, testing different image qualities, question types, and how well the system handles ongoing conversations. They also set up ways to test how well the system finds information from different sources, like image databases and the internet. They then tested existing systems on this benchmark.

Why it matters?

This benchmark is important because it shows that current systems aren't very good at this task – even the best ones struggle. By providing a challenging and realistic test, CRAG-MM will help researchers develop better AI systems for wearable devices, making them more helpful and reliable. The benchmark has already spurred competition, with a recent contest leading to significant improvements in performance, demonstrating its value to the field.

Abstract

Wearable devices such as smart glasses are transforming the way people interact with their surroundings, enabling users to seek information regarding entities in their view. Multi-Modal Retrieval-Augmented Generation (MM-RAG) plays a key role in supporting such questions, yet there is still no comprehensive benchmark for this task, especially regarding wearables scenarios. To fill this gap, we present CRAG-MM -- a Comprehensive RAG benchmark for Multi-modal Multi-turn conversations. CRAG-MM contains a diverse set of 6.5K (image, question, answer) triplets and 2K visual-based multi-turn conversations across 13 domains, including 6.2K egocentric images designed to mimic captures from wearable devices. We carefully constructed the questions to reflect real-world scenarios and challenges, including five types of image-quality issues, six question types, varying entity popularity, differing information dynamism, and different conversation turns. We design three tasks: single-source augmentation, multi-source augmentation, and multi-turn conversations -- each paired with an associated retrieval corpus and APIs for both image-KG retrieval and webpage retrieval. Our evaluation shows that straightforward RAG approaches achieve only 32% and 43% truthfulness on CRAG-MM single- and multi-turn QA, respectively, whereas state-of-the-art industry solutions have similar quality (32%/45%), underscoring ample room for improvement. The benchmark has hosted KDD Cup 2025, attracting about 1K participants and 5K submissions, with winning solutions improving baseline performance by 28%, highlighting its early impact on advancing the field.

View Paper