VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

Hyojun Go, Dominik Narnhofer, Goutam Bhat, Prune Truong, Federico Tombari, Konrad Schindler

2025-10-17

VIST3A: Text-to-3D by Stitching a Multi-view Reconstruction Network to a Video Generator

Summary

This paper introduces a new method, VIST3A, for creating 3D scenes from just text descriptions. It combines the strengths of AI models that are good at generating videos from text with models that are good at building 3D shapes from images.

What's the problem?

Creating 3D scenes from text is hard because you need to connect two different types of AI. One type generates a general idea of what the scene should look like, and the other type actually builds the 3D structure. The challenge is making sure these two parts work well together without losing the information each one already knows, and ensuring the generated 'idea' can actually be turned into a realistic 3D scene.

What's the solution?

The researchers tackled this by carefully 'stitching' the two AI models together. They found the best point to connect them so the information flows smoothly. Then, they used a technique called 'direct reward finetuning' to train the first AI to create 'ideas' that the second AI can easily turn into good-looking 3D scenes. This process doesn't require a lot of data or labeled examples.

Why it matters?

This work is important because it significantly improves the quality of 3D scenes generated from text compared to previous methods. It opens the door to easily creating detailed 3D environments just by typing a description, which has potential applications in areas like game development, virtual reality, and design.

Abstract

The rapid progress of large, pretrained models for both visual content generation and 3D reconstruction opens up new possibilities for text-to-3D generation. Intuitively, one could obtain a formidable 3D scene generator if one were able to combine the power of a modern latent text-to-video model as "generator" with the geometric abilities of a recent (feedforward) 3D reconstruction system as "decoder". We introduce VIST3A, a general framework that does just that, addressing two main challenges. First, the two components must be joined in a way that preserves the rich knowledge encoded in their weights. We revisit model stitching, i.e., we identify the layer in the 3D decoder that best matches the latent representation produced by the text-to-video generator and stitch the two parts together. That operation requires only a small dataset and no labels. Second, the text-to-video generator must be aligned with the stitched 3D decoder, to ensure that the generated latents are decodable into consistent, perceptually convincing 3D scene geometry. To that end, we adapt direct reward finetuning, a popular technique for human preference alignment. We evaluate the proposed VIST3A approach with different video generators and 3D reconstruction models. All tested pairings markedly improve over prior text-to-3D models that output Gaussian splats. Moreover, by choosing a suitable 3D base model, VIST3A also enables high-quality text-to-pointmap generation.

View Paper