ViDoRAG: Visual Document Retrieval-Augmented Generation via Dynamic Iterative Reasoning Agents
Qiuchen Wang, Ruixue Ding, Zehui Chen, Weiqi Wu, Shihang Wang, Pengjun Xie, Feng Zhao
2025-03-03

Summary
This paper talks about ViDoRAG, a new AI system designed to understand and process visually rich documents, like those with images, charts, and text, more effectively. It also introduces ViDoSeek, a special dataset to test how well AI can handle these types of documents.
What's the problem?
Traditional AI systems struggle to work with complex documents that combine text and visuals. They either focus too much on the visual parts or don't use enough reasoning steps to fully understand the information. This makes it hard for them to retrieve and generate accurate answers from such documents.
What's the solution?
The researchers created ViDoRAG, which uses a multi-agent system to tackle these challenges. It combines visual and text-based information using a method called Gaussian Mixture Model (GMM) for better retrieval. The system also has agents that work together iteratively to explore, summarize, and refine answers. They tested this approach on the new ViDoSeek dataset, showing that ViDoRAG performs over 10% better than older methods.
Why it matters?
This matters because it improves how AI handles complex documents, making it more reliable for tasks like analyzing legal files, medical reports, or business data. By combining visuals and text more effectively and reasoning through the information step by step, ViDoRAG sets a new standard for document understanding in AI.
Abstract
Understanding information from visually rich documents remains a significant challenge for traditional Retrieval-Augmented Generation (RAG) methods. Existing benchmarks predominantly focus on image-based question answering (QA), overlooking the fundamental challenges of efficient retrieval, comprehension, and reasoning within dense visual documents. To bridge this gap, we introduce ViDoSeek, a novel dataset designed to evaluate RAG performance on visually rich documents requiring complex reasoning. Based on it, we identify key limitations in current RAG approaches: (i) purely visual retrieval methods struggle to effectively integrate both textual and visual features, and (ii) previous approaches often allocate insufficient reasoning tokens, limiting their effectiveness. To address these challenges, we propose ViDoRAG, a novel multi-agent RAG framework tailored for complex reasoning across visual documents. ViDoRAG employs a Gaussian Mixture Model (GMM)-based hybrid strategy to effectively handle multi-modal retrieval. To further elicit the model's reasoning capabilities, we introduce an iterative agent workflow incorporating exploration, summarization, and reflection, providing a framework for investigating test-time scaling in RAG domains. Extensive experiments on ViDoSeek validate the effectiveness and generalization of our approach. Notably, ViDoRAG outperforms existing methods by over 10% on the competitive ViDoSeek benchmark.