DriveGen3D: Boosting Feed-Forward Driving Scene Generation with Efficient Video Diffusion
Weijie Wang, Jiagang Zhu, Zeyu Zhang, Xiaofeng Wang, Zheng Zhu, Guosheng Zhao, Chaojun Ni, Haoxiao Wang, Guan Huang, Xinze Chen, Yukun Zhou, Wenkang Qin, Duochao Shi, Haoyun Li, Guanghong Jia, Jiwen Lu
2025-10-20
Summary
This paper introduces DriveGen3D, a new system for creating realistic and controllable 3D driving scenes with accompanying videos.
What's the problem?
Currently, making long, detailed 3D driving simulations is really hard. Existing methods either take too long to compute, only focus on making videos without a 3D world, or can only recreate a single, unchanging scene. There's a gap in technology that can quickly generate both a realistic video *and* a corresponding 3D environment that changes over time.
What's the solution?
The researchers developed a two-part system. First, 'FastDrive-DiT' quickly creates high-quality, consistent videos of driving scenes based on text descriptions or a simple map-like layout. Second, 'FastRecon3D' rapidly builds a 3D model of the scene as the video plays, making sure the 3D world matches what's happening in the video. By combining these, they can generate extended driving videos and 3D scenes in real-time.
Why it matters?
This work is important because it allows for the creation of much more realistic and useful driving simulations. These simulations can be used for things like training self-driving cars, testing new road designs, or creating content for video games, and it does so efficiently, opening up possibilities for more complex and longer simulations than previously possible.
Abstract
We present DriveGen3D, a novel framework for generating high-quality and highly controllable dynamic 3D driving scenes that addresses critical limitations in existing methodologies. Current approaches to driving scene synthesis either suffer from prohibitive computational demands for extended temporal generation, focus exclusively on prolonged video synthesis without 3D representation, or restrict themselves to static single-scene reconstruction. Our work bridges this methodological gap by integrating accelerated long-term video generation with large-scale dynamic scene reconstruction through multimodal conditional control. DriveGen3D introduces a unified pipeline consisting of two specialized components: FastDrive-DiT, an efficient video diffusion transformer for high-resolution, temporally coherent video synthesis under text and Bird's-Eye-View (BEV) layout guidance; and FastRecon3D, a feed-forward reconstruction module that rapidly builds 3D Gaussian representations across time, ensuring spatial-temporal consistency. Together, these components enable real-time generation of extended driving videos (up to 424times800 at 12 FPS) and corresponding dynamic 3D scenes, achieving SSIM of 0.811 and PSNR of 22.84 on novel view synthesis, all while maintaining parameter efficiency.