Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation

Chenghao Zhang, Guanting Dong, Xinyu Yang, Zhicheng Dou

2025-10-21

Towards Mixed-Modal Retrieval for Universal Retrieval-Augmented Generation

Summary

This paper introduces a new system called Nyx that improves how large language models generate text by letting them look up information in a variety of sources, including both text and images. It's about making these models better at understanding and responding to questions that involve both seeing and reading.

What's the problem?

Current systems that help language models by retrieving information mostly work with just text. However, in the real world, we often need to understand information presented in multiple ways – like a question about a picture or a document with images. Existing methods struggle when both the question and the information they need to find come in different formats like text and images combined, limiting their usefulness in many practical situations.

What's the solution?

The researchers created Nyx, a system designed to handle both text and images when searching for information. To train Nyx, they built a new dataset called NyxQA, which contains questions and answers that include both images and text, gathered from the internet. They trained Nyx in two steps: first, they let it learn from NyxQA and other existing datasets, and then they refined it using feedback from other vision-language models to make sure the information it retrieves is actually helpful for generating good answers.

Why it matters?

This work is important because it makes language models more versatile and capable of handling real-world information, which is often presented in multiple formats. By improving their ability to understand and reason about both text and images, these models can provide more accurate and relevant responses to a wider range of questions, ultimately making them more useful in applications like visual question answering and image captioning.

Abstract

Retrieval-Augmented Generation (RAG) has emerged as a powerful paradigm for enhancing large language models (LLMs) by retrieving relevant documents from an external corpus. However, existing RAG systems primarily focus on unimodal text documents, and often fall short in real-world scenarios where both queries and documents may contain mixed modalities (such as text and images). In this paper, we address the challenge of Universal Retrieval-Augmented Generation (URAG), which involves retrieving and reasoning over mixed-modal information to improve vision-language generation. To this end, we propose Nyx, a unified mixed-modal to mixed-modal retriever tailored for URAG scenarios. To mitigate the scarcity of realistic mixed-modal data, we introduce a four-stage automated pipeline for generation and filtering, leveraging web documents to construct NyxQA, a dataset comprising diverse mixed-modal question-answer pairs that better reflect real-world information needs. Building on this high-quality dataset, we adopt a two-stage training framework for Nyx: we first perform pre-training on NyxQA along with a variety of open-source retrieval datasets, followed by supervised fine-tuning using feedback from downstream vision-language models (VLMs) to align retrieval outputs with generative preferences. Experimental results demonstrate that Nyx not only performs competitively on standard text-only RAG benchmarks, but also excels in the more general and realistic URAG setting, significantly improving generation quality in vision-language tasks.

View Paper