Unifying Multimodal Retrieval via Document Screenshot Embedding

Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, Jimmy Lin

2024-06-18

Unifying Multimodal Retrieval via Document Screenshot Embedding

Summary

This paper introduces Document Screenshot Embedding (DSE), a new method for retrieving information from documents by using screenshots instead of traditional text extraction methods. DSE simplifies the process of finding relevant documents by preserving all the visual information in a screenshot.

What's the problem?

Retrieving information from documents can be complicated because documents come in many formats, like PDFs, web pages, and presentations. Traditional methods require a lot of preparation, such as breaking down the document into text and images, which can lead to errors and loss of important information. This makes it hard to find what you're looking for quickly and accurately.

What's the solution?

To solve this problem, the authors developed DSE, which treats document screenshots as a single input format. Instead of extracting text and images separately, DSE uses a large vision-language model to directly encode the entire screenshot into a dense representation that captures all the information in the document. They tested DSE using a dataset of 1.3 million screenshots from Wikipedia and found that it performed significantly better than traditional text-based retrieval methods, achieving higher accuracy in finding relevant documents.

Why it matters?

This research is important because it offers a more efficient way to retrieve information from various types of documents without losing details. By using screenshots, DSE can help improve search engines and information retrieval systems, making it easier for users to find accurate information quickly. This advancement could benefit many fields, including education, research, and business, where accessing relevant documents is crucial.

Abstract

In the real world, documents are organized in different formats and varied modalities. Traditional retrieval pipelines require tailored document parsing techniques and content extraction modules to prepare input for indexing. This process is tedious, prone to errors, and has information loss. To this end, we propose Document Screenshot Embedding} (DSE), a novel retrieval paradigm that regards document screenshots as a unified input format, which does not require any content extraction preprocess and preserves all the information in a document (e.g., text, image and layout). DSE leverages a large vision-language model to directly encode document screenshots into dense representations for retrieval. To evaluate our method, we first craft the dataset of Wiki-SS, a 1.3M Wikipedia web page screenshots as the corpus to answer the questions from the Natural Questions dataset. In such a text-intensive document retrieval setting, DSE shows competitive effectiveness compared to other text retrieval methods relying on parsing. For example, DSE outperforms BM25 by 17 points in top-1 retrieval accuracy. Additionally, in a mixed-modality task of slide retrieval, DSE significantly outperforms OCR text retrieval methods by over 15 points in nDCG@10. These experiments show that DSE is an effective document retrieval paradigm for diverse types of documents. Model checkpoints, code, and Wiki-SS collection will be released.

View Paper