VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, Maosong Sun

2024-10-15

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Summary

This paper introduces VisRAG, a new system that improves how large language models (LLMs) can generate content by using both text and visual information from documents.

What's the problem?

Current systems for retrieval-augmented generation (RAG) rely only on text, which means they can't take advantage of important visual elements like images and layouts found in multi-modal documents. This limits their ability to fully understand and generate content based on these documents.

What's the solution?

VisRAG solves this problem by using a vision-language model (VLM) to directly embed the document as an image instead of extracting text first. This approach helps retain all the information from the original document, leading to better performance in both retrieving relevant information and generating responses. The authors also collected various data sources to train the system effectively and tested it against traditional text-based methods, showing significant improvements in performance.

Why it matters?

This research is important because it enhances the capabilities of AI models in handling complex documents that contain both text and images. By allowing models to utilize visual information effectively, VisRAG can lead to more accurate and informative responses in applications like research, education, and content creation.

Abstract

Retrieval-augmented generation (RAG) is an effective technique that enables large language models (LLMs) to utilize external knowledge sources for generation. However, current RAG systems are solely based on text, rendering it impossible to utilize vision information like layout and images that play crucial roles in real-world multi-modality documents. In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. Compared to traditional text-based RAG, VisRAG maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process. We collect both open-source and synthetic data to train the retriever in VisRAG and explore a variety of generation methods. Experiments demonstrate that VisRAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 25--39\% end-to-end performance gain over traditional text-based RAG pipeline. Further analysis reveals that VisRAG is effective in utilizing training data and demonstrates strong generalization capability, positioning it as a promising solution for RAG on multi-modality documents. Our code and data are available at https://github.com/openbmb/visrag .

View Paper