MMDocIR: Benchmarking Multi-Modal Retrieval for Long Documents
Kuicai Dong, Yujing Chang, Xin Deik Goh, Dexun Li, Ruiming Tang, Yong Liu
2025-01-16

Summary
This paper talks about MMDocIR, a new benchmark for testing how well computers can find and understand different types of information in long documents, like pictures, tables, and charts, not just text.
What's the problem?
Right now, there's no good way to test how well computer systems can find and understand all the different kinds of information in long documents. This makes it hard to know if these systems are actually working well or to improve them.
What's the solution?
The researchers created MMDocIR, which is like a big test for computer systems. It has two main parts: one that checks if the system can find the right pages in a long document, and another that sees if it can spot specific things on a page, like a particular chart or paragraph. They made this test using over 1,600 carefully checked questions and more than 170,000 computer-generated questions. They then used MMDocIR to test different types of computer systems and found out some interesting things, like that systems that look at pictures do better than ones that just read text.
Why it matters?
This matters because as we use computers more and more to help us find information in big documents, we need to make sure they're doing a good job with all types of content, not just text. MMDocIR gives researchers a way to test and improve these systems, which could lead to better search tools for things like research papers, textbooks, or even medical records. It shows that looking at pictures and layout, not just words, is really important for understanding documents, which could change how we design search systems in the future.
Abstract
Multi-modal document retrieval is designed to identify and retrieve various forms of multi-modal content, such as figures, tables, charts, and layout information from extensive documents. Despite its significance, there is a notable lack of a robust benchmark to effectively evaluate the performance of systems in multi-modal document retrieval. To address this gap, this work introduces a new benchmark, named as MMDocIR, encompassing two distinct tasks: page-level and layout-level retrieval. The former focuses on localizing the most relevant pages within a long document, while the latter targets the detection of specific layouts, offering a more fine-grained granularity than whole-page analysis. A layout can refer to a variety of elements such as textual paragraphs, equations, figures, tables, or charts. The MMDocIR benchmark comprises a rich dataset featuring expertly annotated labels for 1,685 questions and bootstrapped labels for 173,843 questions, making it a pivotal resource for advancing multi-modal document retrieval for both training and evaluation. Through rigorous experiments, we reveal that (i) visual retrievers significantly outperform their text counterparts, (ii) MMDocIR train set can effectively benefit the training process of multi-modal document retrieval and (iii) text retrievers leveraging on VLM-text perform much better than those using OCR-text. These findings underscores the potential advantages of integrating visual elements for multi-modal document retrieval.