DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

Zhe Liu, Runhui Huang, Rui Yang, Siming Yan, Zining Wang, Lu Hou, Di Lin, Xiang Bai, Hengshuang Zhao

2025-12-16

DrivePI: Spatial-aware 4D MLLM for Unified Autonomous Driving Understanding, Perception, Prediction and Planning

Summary

This paper introduces DrivePI, a new artificial intelligence system designed to help self-driving cars understand their surroundings and make decisions. It combines vision, language, and action planning into one system, allowing it to perceive the 3D world, predict how things will move, and plan a safe path forward.

What's the problem?

Current self-driving systems often treat understanding the 3D environment, predicting future movement, and planning actions as separate tasks. Existing systems that *do* combine these tasks aren't very good at all of them simultaneously, or they require very large and complex models. There's a need for a more unified and efficient approach to help self-driving cars navigate complex situations.

What's the solution?

The researchers created DrivePI, which is a single model that handles all three tasks – understanding the 3D space around the car, predicting where objects will move, and deciding what the car should do – at the same time. It uses a combination of different types of data, like point clouds (3D maps), images, and text instructions, and is built on a relatively small language model. They also created a way to generate training data specifically for this combined approach, helping the model learn how to connect language with spatial understanding and prediction.

Why it matters?

DrivePI is important because it shows that a single, relatively small AI model can perform all the key tasks needed for self-driving cars at a level comparable to or better than existing, more complex systems. This could lead to more efficient and safer self-driving technology, potentially making it more affordable and accessible. It demonstrates a promising direction for building more integrated and capable autonomous driving systems.

Abstract

Although multi-modal large language models (MLLMs) have shown strong capabilities across diverse domains, their application in generating fine-grained 3D perception and prediction outputs in autonomous driving remains underexplored. In this paper, we propose DrivePI, a novel spatial-aware 4D MLLM that serves as a unified Vision-Language-Action (VLA) framework that is also compatible with vision-action (VA) models. Our method jointly performs spatial understanding, 3D perception (i.e., 3D occupancy), prediction (i.e., occupancy flow), and planning (i.e., action outputs) in parallel through end-to-end optimization. To obtain both precise geometric information and rich visual appearance, our approach integrates point clouds, multi-view images, and language instructions within a unified MLLM architecture. We further develop a data engine to generate text-occupancy and text-flow QA pairs for 4D spatial understanding. Remarkably, with only a 0.5B Qwen2.5 model as MLLM backbone, DrivePI as a single unified model matches or exceeds both existing VLA models and specialized VA models. Specifically, compared to VLA models, DrivePI outperforms OpenDriveVLA-7B by 2.5% mean accuracy on nuScenes-QA and reduces collision rate by 70% over ORION (from 0.37% to 0.11%) on nuScenes. Against specialized VA models, DrivePI surpasses FB-OCC by 10.3 RayIoU for 3D occupancy on OpenOcc, reduces the mAVE from 0.591 to 0.509 for occupancy flow on OpenOcc, and achieves 32% lower L2 error than VAD (from 0.72m to 0.49m) for planning on nuScenes. Code will be available at https://github.com/happinesslz/DrivePI

View Paper