Ask in Any Modality: A Comprehensive Survey on Multimodal Retrieval-Augmented Generation

Mohammad Mahdi Abootorabi, Amirhosein Zobeiri, Mahdi Dehghani, Mohammadali Mohammadkhani, Bardia Mohammadi, Omid Ghahroodi, Mahdieh Soleymani Baghshah, Ehsaneddin Asgari

2025-02-18

Ask in Any Modality: A Comprehensive Survey on Multimodal
Retrieval-Augmented Generation

Summary

This paper talks about Multimodal Retrieval-Augmented Generation (RAG), which is a new way to make AI language models smarter by letting them use different types of information like text, images, audio, and video when answering questions or creating content.

What's the problem?

Regular AI language models can sometimes make stuff up or give outdated information because they only know what they were originally taught. They also struggle to understand and connect different types of information, like matching a picture with a description.

What's the solution?

The researchers studied how Multimodal RAG systems work and looked at all the different parts that make them up. They examined the special datasets used to train these systems, how to measure if they're doing a good job, and the clever tricks used to make the AI understand and use different types of information together. They also explored how to make these systems more reliable and able to handle real-world situations.

Why it matters?

This matters because it helps create AI that can understand and use information more like humans do, by combining different types of data. This could lead to smarter digital assistants, better search engines, and AI that can help with complex tasks in fields like healthcare or education. By laying out what we know and what still needs work, this study guides future research to make AI systems that are more capable and trustworthy.

Abstract

Large Language Models (LLMs) struggle with hallucinations and outdated knowledge due to their reliance on static training data. Retrieval-Augmented Generation (RAG) mitigates these issues by integrating external dynamic information enhancing factual and updated grounding. Recent advances in multimodal learning have led to the development of Multimodal RAG, incorporating multiple modalities such as text, images, audio, and video to enhance the generated outputs. However, cross-modal alignment and reasoning introduce unique challenges to Multimodal RAG, distinguishing it from traditional unimodal RAG. This survey offers a structured and comprehensive analysis of Multimodal RAG systems, covering datasets, metrics, benchmarks, evaluation, methodologies, and innovations in retrieval, fusion, augmentation, and generation. We precisely review training strategies, robustness enhancements, and loss functions, while also exploring the diverse Multimodal RAG scenarios. Furthermore, we discuss open challenges and future research directions to support advancements in this evolving field. This survey lays the foundation for developing more capable and reliable AI systems that effectively leverage multimodal dynamic external knowledge bases. Resources are available at https://github.com/llm-lab-org/Multimodal-RAG-Survey.

View Paper