DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

Yang Zhou, Hao Shao, Letian Wang, Zhuofan Zong, Hongsheng Li, Steven L. Waslander

2026-01-13

DrivingGen: A Comprehensive Benchmark for Generative Video World Models in Autonomous Driving

Summary

This paper introduces a new way to test and measure how well AI models can predict what will happen in future driving scenarios, essentially letting them 'imagine' the road ahead.

What's the problem?

Currently, there isn't a good, all-encompassing way to judge how effective these 'driving world models' are. Existing tests focus on things like how realistic the video looks, but miss crucial aspects like whether the predicted movements of cars follow the laws of physics or if the AI can actually control the simulation based on what a driver does. Also, the data used to train and test these models doesn't represent the wide variety of real-world driving conditions.

What's the solution?

The researchers created a new benchmark called DrivingGen. This benchmark includes a large and diverse dataset of driving scenes, covering different weather, times of day, and locations. They also developed a set of new tests that look at not just visual realism, but also how believable the predicted paths of vehicles are, how consistent the simulation is over time, and how well the AI responds to driver input. They then tested 14 existing AI models using this new benchmark.

Why it matters?

Having a reliable benchmark like DrivingGen is important because it allows researchers to accurately compare different models, identify their strengths and weaknesses, and ultimately build safer and more effective AI systems for autonomous driving. This will help with things like testing self-driving cars in simulated environments and generating data to train these systems without needing to rely solely on expensive and potentially dangerous real-world driving.

Abstract

Video generation models, as one form of world models, have emerged as one of the most exciting frontiers in AI, promising agents the ability to imagine the future by modeling the temporal evolution of complex scenes. In autonomous driving, this vision gives rise to driving world models: generative simulators that imagine ego and agent futures, enabling scalable simulation, safe testing of corner cases, and rich synthetic data generation. Yet, despite fast-growing research activity, the field lacks a rigorous benchmark to measure progress and guide priorities. Existing evaluations remain limited: generic video metrics overlook safety-critical imaging factors; trajectory plausibility is rarely quantified; temporal and agent-level consistency is neglected; and controllability with respect to ego conditioning is ignored. Moreover, current datasets fail to cover the diversity of conditions required for real-world deployment. To address these gaps, we present DrivingGen, the first comprehensive benchmark for generative driving world models. DrivingGen combines a diverse evaluation dataset curated from both driving datasets and internet-scale video sources, spanning varied weather, time of day, geographic regions, and complex maneuvers, with a suite of new metrics that jointly assess visual realism, trajectory plausibility, temporal coherence, and controllability. Benchmarking 14 state-of-the-art models reveals clear trade-offs: general models look better but break physics, while driving-specific ones capture motion realistically but lag in visual quality. DrivingGen offers a unified evaluation framework to foster reliable, controllable, and deployable driving world models, enabling scalable simulation, planning, and data-driven decision-making.

View Paper