To address this, VideoFrom3D proposes a generative framework that leverages the complementary strengths of image and video diffusion models. Specifically, the framework consists of a Sparse Anchor-view Generation (SAG) and a Geometry-guided Generative Inbetweening (GGI) module. The SAG module generates high-quality, cross-view consistent anchor views using an image diffusion model, aided by Sparse Appearance-guided Sampling. Building on these anchor views, GGI module faithfully interpolates intermediate frames using a video diffusion model, enhanced by flow-based camera control and structural guidance.
The synthesized video sequence shows consistent, high-quality visuals that reflect the input geometry and reference style, including challenging visual elements such as rising steam. Comprehensive experiments show that VideoFrom3D produces high-quality, style-consistent scene videos under diverse and challenging scenarios, outperforming simple and extended baselines. The framework operates without any paired dataset of 3D scene models and natural images, which is extremely difficult to obtain.