M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding

Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, Mohit Bansal

2024-11-08

M3DocRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding

Summary

This paper presents M3DocRAG, a new framework designed to answer questions from documents that may span multiple pages and formats, such as text, charts, and images.

What's the problem?

Existing methods for document visual question answering (DocVQA) usually focus on single-page documents or rely on text extraction tools that often miss important visual information. This makes it difficult to answer questions that require information from different pages or documents, especially when that information is in images or figures.

What's the solution?

M3DocRAG introduces a multi-modal retrieval-augmented generation approach that can handle various types of documents and questions. It retrieves relevant pages using a specialized model and then generates answers based on the retrieved information. This framework allows it to work with both closed-domain (specific documents) and open-domain (general documents) contexts, making it flexible and efficient. The researchers also created a new benchmark called M3DocVQA to evaluate the performance of this framework across thousands of PDF documents.

Why it matters?

This research is significant because it improves how AI systems can understand and answer complex questions about documents. By effectively integrating information from multiple sources, M3DocRAG could enhance applications in fields like healthcare, law, and education, where accurate document analysis is crucial for making informed decisions.

Abstract

Document visual question answering (DocVQA) pipelines that answer questions from documents have broad applications. Existing methods focus on handling single-page documents with multi-modal language models (MLMs), or rely on text-based retrieval-augmented generation (RAG) that uses text extraction tools such as optical character recognition (OCR). However, there are difficulties in applying these methods in real-world scenarios: (a) questions often require information across different pages or documents, where MLMs cannot handle many long documents; (b) documents often have important information in visual elements such as figures, but text extraction tools ignore them. We introduce M3DocRAG, a novel multi-modal RAG framework that flexibly accommodates various document contexts (closed-domain and open-domain), question hops (single-hop and multi-hop), and evidence modalities (text, chart, figure, etc.). M3DocRAG finds relevant documents and answers questions using a multi-modal retriever and an MLM, so that it can efficiently handle single or many documents while preserving visual information. Since previous DocVQA datasets ask questions in the context of a specific document, we also present M3DocVQA, a new benchmark for evaluating open-domain DocVQA over 3,000+ PDF documents with 40,000+ pages. In three benchmarks (M3DocVQA/MMLongBench-Doc/MP-DocVQA), empirical results show that M3DocRAG with ColPali and Qwen2-VL 7B achieves superior performance than many strong baselines, including state-of-the-art performance in MP-DocVQA. We provide comprehensive analyses of different indexing, MLMs, and retrieval models. Lastly, we qualitatively show that M3DocRAG can successfully handle various scenarios, such as when relevant information exists across multiple pages and when answer evidence only exists in images.

View Paper