MV-RAG: Retrieval Augmented Multiview Diffusion

Yosef Dayani, Omer Benishu, Sagie Benaim

2025-08-26

MV-RAG: Retrieval Augmented Multiview Diffusion

Summary

This paper introduces a new method, MV-RAG, for creating 3D models from text descriptions, focusing on improving the creation of objects that are unusual or not commonly found in the training data.

What's the problem?

Current text-to-3D generation systems are really good at making common objects, but they struggle when asked to create something rare or outside of what they’ve ‘seen’ before during training, often resulting in 3D models that don’t make sense or don’t accurately reflect the text description. Essentially, they lack the ability to generalize to new concepts.

What's the solution?

MV-RAG tackles this by first searching a large database of real-world 2D images for pictures relevant to the text description. Then, it uses these found images as a guide to build a consistent 3D model from multiple viewpoints. The system is trained in a special way, learning both from structured 3D data and the more varied 2D images, and it’s specifically taught to predict what a hidden view of an object would look like to ensure the 3D model is coherent. They also created a new set of challenging prompts to test how well the system handles unusual requests.

Why it matters?

This research is important because it makes 3D creation from text more reliable and versatile. By being able to handle rare or out-of-domain concepts, it opens up possibilities for creating a wider range of 3D content, which is useful for things like game development, design, and even virtual reality experiences, where you often need to generate objects that aren't standard.

Abstract

Text-to-3D generation approaches have advanced significantly by leveraging pretrained 2D diffusion priors, producing high-quality and 3D-consistent outputs. However, they often fail to produce out-of-domain (OOD) or rare concepts, yielding inconsistent or inaccurate results. To this end, we propose MV-RAG, a novel text-to-3D pipeline that first retrieves relevant 2D images from a large in-the-wild 2D database and then conditions a multiview diffusion model on these images to synthesize consistent and accurate multiview outputs. Training such a retrieval-conditioned model is achieved via a novel hybrid strategy bridging structured multiview data and diverse 2D image collections. This involves training on multiview data using augmented conditioning views that simulate retrieval variance for view-specific reconstruction, alongside training on sets of retrieved real-world 2D images using a distinctive held-out view prediction objective: the model predicts the held-out view from the other views to infer 3D consistency from 2D data. To facilitate a rigorous OOD evaluation, we introduce a new collection of challenging OOD prompts. Experiments against state-of-the-art text-to-3D, image-to-3D, and personalization baselines show that our approach significantly improves 3D consistency, photorealism, and text adherence for OOD/rare concepts, while maintaining competitive performance on standard benchmarks.

View Paper