MaskGWM: A Generalizable Driving World Model with Video Mask Reconstruction
Jingcheng Ni, Yuxin Guo, Yichen Liu, Rui Chen, Lewei Lu, Zehuan Wu
2025-02-24
Summary
This paper talks about MaskGWM, a new AI system for self-driving cars that can better predict and understand the road environment over longer periods and from multiple viewpoints.
What's the problem?
Current AI models for self-driving cars are good at making short-term predictions about the road, but they struggle to understand what might happen further into the future or in different situations they haven't seen before. It's like having a driver who can only see a few feet ahead and gets confused on unfamiliar roads.
What's the solution?
The researchers created MaskGWM, which uses a special technique called 'mask reconstruction' to help the AI learn better. They made the AI fill in missing parts of road scenes, kind of like solving a puzzle, which helps it understand the whole picture better. They also made two versions: one that's good at long-term predictions and another that can understand the road from different angles.
Why it matters?
This matters because it could make self-driving cars much safer and more reliable. If cars can predict what might happen further ahead and understand the road from all angles, they can make better decisions and handle unexpected situations more safely. This could bring us closer to having truly autonomous vehicles that can drive in all kinds of conditions, potentially reducing accidents and making transportation more efficient for everyone.
Abstract
World models that forecast environmental changes from actions are vital for autonomous driving models with strong generalization. The prevailing driving world model mainly build on video prediction model. Although these models can produce high-fidelity video sequences with advanced diffusion-based generator, they are constrained by their predictive duration and overall generalization capabilities. In this paper, we explore to solve this problem by combining generation loss with MAE-style feature-level context learning. In particular, we instantiate this target with three key design: (1) A more scalable Diffusion Transformer (DiT) structure trained with extra mask construction task. (2) we devise diffusion-related mask tokens to deal with the fuzzy relations between mask reconstruction and generative diffusion process. (3) we extend mask construction task to spatial-temporal domain by utilizing row-wise mask for shifted self-attention rather than masked self-attention in MAE. Then, we adopt a row-wise cross-view module to align with this mask design. Based on above improvement, we propose MaskGWM: a Generalizable driving World Model embodied with Video Mask reconstruction. Our model contains two variants: <PRE_TAG>MaskGWM-long</POST_TAG>, focusing on long-horizon prediction, and <PRE_TAG>MaskGWM-mview</POST_TAG>, dedicated to multi-view generation. Comprehensive experiments on standard benchmarks validate the effectiveness of the proposed method, which contain normal validation of Nuscene dataset, long-horizon rollout of OpenDV-2K dataset and zero-shot validation of Waymo dataset. Quantitative metrics on these datasets show our method notably improving state-of-the-art driving world model.