MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

Ruiyuan Gao, Kai Chen, Bo Xiao, Lanqing Hong, Zhenguo Li, Qiang Xu

2024-11-22

MagicDriveDiT: High-Resolution Long Video Generation for Autonomous Driving with Adaptive Control

Summary

This paper presents MagicDriveDiT, a new method for generating high-resolution and long videos specifically designed for applications in autonomous driving, using advanced video synthesis techniques.

What's the problem?

Current methods for generating videos often struggle with creating high-quality, long videos that are essential for tasks like autonomous driving. They also have difficulties in scaling up their capabilities and integrating control conditions effectively, which limits their usefulness in real-world scenarios where precise control is needed.

What's the solution?

MagicDriveDiT addresses these challenges by using a novel approach based on the DiT architecture. It enhances scalability through flow matching and employs a progressive training strategy to handle complex scenarios better. The model incorporates spatial-temporal conditional encoding, allowing for precise control over how objects move and interact in the generated videos. This results in the ability to create realistic street scene videos with higher resolution and more frames than previous methods.

Why it matters?

This research is significant because it improves the technology used for generating videos in autonomous driving applications, which require detailed and accurate representations of road environments. By enhancing video generation quality and control, MagicDriveDiT can help develop safer and more reliable autonomous vehicles, ultimately contributing to advancements in transportation technology.

Abstract

The rapid advancement of diffusion models has greatly improved video synthesis, especially in controllable video generation, which is essential for applications like autonomous driving. However, existing methods are limited by scalability and how control conditions are integrated, failing to meet the needs for high-resolution and long videos for autonomous driving applications. In this paper, we introduce MagicDriveDiT, a novel approach based on the DiT architecture, and tackle these challenges. Our method enhances scalability through flow matching and employs a progressive training strategy to manage complex scenarios. By incorporating spatial-temporal conditional encoding, MagicDriveDiT achieves precise control over spatial-temporal latents. Comprehensive experiments show its superior performance in generating realistic street scene videos with higher resolution and more frames. MagicDriveDiT significantly improves video generation quality and spatial-temporal controls, expanding its potential applications across various tasks in autonomous driving.

View Paper