UniUGP: Unifying Understanding, Generation, and Planing For End-to-end Autonomous Driving
Hao Lu, Ziyang Liu, Guangfeng Jiang, Yuanfei Luo, Sheng Chen, Yangang Zhang, Ying-Cong Chen
2025-12-11
Summary
This paper tackles the challenge of making self-driving cars better at handling unusual and difficult situations, often called 'long-tail scenarios'. It introduces a new system that combines understanding what's happening in a scene, predicting what will happen next, and then planning a safe path for the car.
What's the problem?
Self-driving systems currently struggle when they encounter situations they haven't been specifically programmed for. They lack a broad understanding of the world and aren't great at predicting how things will change over time, especially visually. Existing methods either can't learn from unlabeled video footage to understand cause and effect, or they don't use the powerful reasoning abilities of large language models to make smart decisions.
What's the solution?
The researchers created new datasets with detailed information about complex driving scenarios, including reasoning and planning steps. They then developed a system called UniUGP, which works in three main parts: understanding the current scene, generating possible future scenarios as videos, and planning a safe route. UniUGP uses pre-trained models that are good at both understanding language and creating realistic videos, allowing it to predict how the world will change and make better driving decisions. It's trained in stages, gradually learning to perform each of these tasks across several existing and newly created datasets.
Why it matters?
This work is important because it moves self-driving cars closer to being able to handle the unpredictable nature of real-world driving. By improving their ability to reason about situations and predict what might happen, these systems can become safer and more reliable, especially in those challenging 'long-tail' scenarios that are currently a major hurdle for the technology.
Abstract
Autonomous driving (AD) systems struggle in long-tail scenarios due to limited world knowledge and weak visual dynamic modeling. Existing vision-language-action (VLA)-based methods cannot leverage unlabeled videos for visual causal learning, while world model-based methods lack reasoning capabilities from large language models. In this paper, we construct multiple specialized datasets providing reasoning and planning annotations for complex scenarios. Then, a unified Understanding-Generation-Planning framework, named UniUGP, is proposed to synergize scene reasoning, future video generation, and trajectory planning through a hybrid expert architecture. By integrating pre-trained VLMs and video generation models, UniUGP leverages visual dynamics and semantic reasoning to enhance planning performance. Taking multi-frame observations and language instructions as input, it produces interpretable chain-of-thought reasoning, physically consistent trajectories, and coherent future videos. We introduce a four-stage training strategy that progressively builds these capabilities across multiple existing AD datasets, along with the proposed specialized datasets. Experiments demonstrate state-of-the-art performance in perception, reasoning, and decision-making, with superior generalization to challenging long-tail situations.