RAG-Anything: All-in-One RAG Framework

Zirui Guo, Xubin Ren, Lingrui Xu, Jiahao Zhang, Chao Huang

2025-10-15

Summary

This paper introduces RAG-Anything, a new system designed to improve how Large Language Models (LLMs) use information from various sources like text, images, tables, and equations. It's about making LLMs smarter by letting them access and understand a wider range of knowledge.

What's the problem?

Current systems that help LLMs find and use information, called Retrieval-Augmented Generation or RAG, mostly focus on text. However, real-world information isn't just text; it includes pictures, charts, and formulas. This limits how well LLMs can answer questions or solve problems when the information they need is spread across different types of data. Existing RAG systems struggle with these 'multimodal' documents, meaning documents with multiple types of content.

What's the solution?

The researchers created RAG-Anything, which treats all types of information – text, images, tables, etc. – as connected pieces of knowledge. They built a system that creates two kinds of connections: one showing how different types of information relate to each other, and another understanding the meaning of the text itself. This allows the system to search for information using both the structure of the data and the meaning of the words, even if the answer requires combining information from different sources like an image and a paragraph of text. It's like having a system that can 'read' a chart and then explain it using text.

Why it matters?

RAG-Anything is important because it significantly improves LLMs' ability to work with complex, real-world information. It overcomes the limitations of existing systems that can only handle text, and it performs especially well with long documents where finding the right information is difficult. This advancement paves the way for LLMs that can truly understand and reason about the world around them, making them more useful for tasks like research, problem-solving, and education.

Abstract

Retrieval-Augmented Generation (RAG) has emerged as a fundamental paradigm for expanding Large Language Models beyond their static training limitations. However, a critical misalignment exists between current RAG capabilities and real-world information environments. Modern knowledge repositories are inherently multimodal, containing rich combinations of textual content, visual elements, structured tables, and mathematical expressions. Yet existing RAG frameworks are limited to textual content, creating fundamental gaps when processing multimodal documents. We present RAG-Anything, a unified framework that enables comprehensive knowledge retrieval across all modalities. Our approach reconceptualizes multimodal content as interconnected knowledge entities rather than isolated data types. The framework introduces dual-graph construction to capture both cross-modal relationships and textual semantics within a unified representation. We develop cross-modal hybrid retrieval that combines structural knowledge navigation with semantic matching. This enables effective reasoning over heterogeneous content where relevant evidence spans multiple modalities. RAG-Anything demonstrates superior performance on challenging multimodal benchmarks, achieving significant improvements over state-of-the-art methods. Performance gains become particularly pronounced on long documents where traditional approaches fail. Our framework establishes a new paradigm for multimodal knowledge access, eliminating the architectural fragmentation that constrains current systems. Our framework is open-sourced at: https://github.com/HKUDS/RAG-Anything.

View Paper