ReconX: Reconstruct Any Scene from Sparse Views with Video Diffusion Model
Fangfu Liu, Wenqiang Sun, Hanyang Wang, Yikai Wang, Haowen Sun, Junliang Ye, Jun Zhang, Yueqi Duan
2024-08-30
Summary
This paper presents ReconX, a new method for creating detailed 3D models from only a few images or video frames, using advanced video diffusion techniques.
What's the problem?
Creating realistic 3D models from limited views can be very challenging. When there aren't enough images to work with, the resulting 3D models often have missing details or look distorted, which makes it hard to accurately represent the scene.
What's the solution?
ReconX addresses this issue by treating the reconstruction of a scene as a task of generating a sequence of video frames. It first builds a global point cloud (a 3D representation of the scene) and uses this to guide a video diffusion model that creates detailed video frames. This approach ensures that the generated frames maintain a consistent look from different angles. Finally, it uses a special technique to convert these video frames back into a complete 3D model.
Why it matters?
This research is important because it allows for better reconstruction of 3D scenes even when only limited views are available. This can be useful in many fields, such as virtual reality, gaming, and architecture, where accurate 3D representations are essential for creating immersive experiences.
Abstract
Advancements in 3D scene reconstruction have transformed 2D images from the real world into 3D models, producing realistic 3D results from hundreds of input photos. Despite great success in dense-view reconstruction scenarios, rendering a detailed scene from insufficient captured views is still an ill-posed optimization problem, often resulting in artifacts and distortions in unseen areas. In this paper, we propose ReconX, a novel 3D scene reconstruction paradigm that reframes the ambiguous reconstruction challenge as a temporal generation task. The key insight is to unleash the strong generative prior of large pre-trained video diffusion models for sparse-view reconstruction. However, 3D view consistency struggles to be accurately preserved in directly generated video frames from pre-trained models. To address this, given limited input views, the proposed ReconX first constructs a global point cloud and encodes it into a contextual space as the 3D structure condition. Guided by the condition, the video diffusion model then synthesizes video frames that are both detail-preserved and exhibit a high degree of 3D consistency, ensuring the coherence of the scene from various perspectives. Finally, we recover the 3D scene from the generated video through a confidence-aware 3D Gaussian Splatting optimization scheme. Extensive experiments on various real-world datasets show the superiority of our ReconX over state-of-the-art methods in terms of quality and generalizability.