Gen3R: 3D Scene Generation Meets Feed-Forward Reconstruction
Jiaxin Huang, Yuanbo Yang, Bangbang Yang, Lin Ma, Yuewen Ma, Yiyi Liao
2026-01-08
Summary
This paper introduces Gen3R, a new technique for creating realistic 3D scenes from images or videos, combining the strengths of existing 3D reconstruction and video generation AI models.
What's the problem?
Creating detailed and accurate 3D models of scenes is difficult. Traditional 3D reconstruction methods can struggle with incomplete data or noisy images, leading to inaccurate results. While video generation models can create visually appealing content, they often lack the geometric consistency needed for true 3D scenes. Essentially, it's hard to get both realistic visuals *and* accurate 3D structure at the same time.
What's the solution?
Gen3R solves this by taking a pre-existing 3D reconstruction model and 'teaching' it to work with a powerful video generation model. They do this by adding a small 'adapter' that translates the 3D model's internal representation into a format the video model understands. Both models then work together, generating both the visual appearance (like colors and textures) and the 3D geometry (like shapes and depth) of the scene simultaneously. This ensures the visuals and 3D structure are aligned and consistent.
Why it matters?
This research is important because it allows for the creation of higher-quality 3D scenes from images and videos than previously possible. It also shows that combining different types of AI models – reconstruction and generation – can lead to better results than using them separately. This could have applications in areas like virtual reality, robotics, and creating realistic simulations.
Abstract
We present Gen3R, a method that bridges the strong priors of foundational reconstruction models and video diffusion models for scene-level 3D generation. We repurpose the VGGT reconstruction model to produce geometric latents by training an adapter on its tokens, which are regularized to align with the appearance latents of pre-trained video diffusion models. By jointly generating these disentangled yet aligned latents, Gen3R produces both RGB videos and corresponding 3D geometry, including camera poses, depth maps, and global point clouds. Experiments demonstrate that our approach achieves state-of-the-art results in single- and multi-image conditioned 3D scene generation. Additionally, our method can enhance the robustness of reconstruction by leveraging generative priors, demonstrating the mutual benefit of tightly coupling reconstruction and generative models.