ViDoRe V3: A Comprehensive Evaluation of Retrieval Augmented Generation in Complex Real-World Scenarios
António Loison, Quentin Macé, Antoine Edy, Victor Xing, Tom Balough, Gabriel Moreira, Bo Liu, Manuel Faysse, Céline Hudelot, Gautier Viaud
2026-01-14
Summary
This paper introduces a new benchmark called ViDoRe v3 to better test how well AI systems can answer questions based on documents that include images, charts, and tables, not just text.
What's the problem?
Current methods for testing Retrieval-Augmented Generation (RAG) systems, which combine searching for information and then generating answers, aren't very good at evaluating how well these systems handle complex documents with visual elements or require combining information from multiple sources. Existing tests mostly focus on simple text-based questions and don't accurately reflect real-world scenarios where understanding visuals is crucial.
What's the solution?
The researchers created ViDoRe v3, a large and diverse collection of over 26,000 document pages with more than 3,000 questions, available in six languages. These questions require understanding both text and visual elements within the documents. They also carefully checked the answers to ensure quality. They then tested existing AI systems on this benchmark to see how they performed.
Why it matters?
This work is important because it provides a more realistic and challenging way to evaluate RAG systems. The results show that AI systems are getting better at using visual information, but still struggle with things like understanding complex charts, answering open-ended questions, and pinpointing exactly where in an image the answer can be found. By releasing this benchmark, the researchers hope to encourage further development of AI that can truly understand and utilize all the information in a document, not just the text.
Abstract
Retrieval-Augmented Generation (RAG) pipelines must address challenges beyond simple single-document retrieval, such as interpreting visual elements (tables, charts, images), synthesizing information across documents, and providing accurate source grounding. Existing benchmarks fail to capture this complexity, often focusing on textual data, single-document comprehension, or evaluating retrieval and generation in isolation. We introduce ViDoRe v3, a comprehensive multimodal RAG benchmark featuring multi-type queries over visually rich document corpora. It covers 10 datasets across diverse professional domains, comprising ~26,000 document pages paired with 3,099 human-verified queries, each available in 6 languages. Through 12,000 hours of human annotation effort, we provide high-quality annotations for retrieval relevance, bounding box localization, and verified reference answers. Our evaluation of state-of-the-art RAG pipelines reveals that visual retrievers outperform textual ones, late-interaction models and textual reranking substantially improve performance, and hybrid or purely visual contexts enhance answer generation quality. However, current models still struggle with non-textual elements, open-ended queries, and fine-grained visual grounding. To encourage progress in addressing these challenges, the benchmark is released under a commercially permissive license at https://hf.co/vidore.