A Survey on Vision-Language-Action Models for Autonomous Driving

Sicong Jiang, Zilin Huang, Kangan Qian, Ziang Luo, Tianze Zhu, Yang Zhong, Yihong Tang, Menglin Kong, Yunlong Wang, Siwen Jiao, Hao Ye, Zihao Sheng, Xin Zhao, Tuopu Wen, Zheng Fu, Sikai Chen, Kun Jiang, Diange Yang, Seongjin Choi, Lijun Sun

2025-07-10

A Survey on Vision-Language-Action Models for Autonomous Driving

Summary

This paper talks about Vision-Language-Action (VLA) models, which are advanced AI systems designed for self-driving cars. These models combine the car's ability to see the environment, understand language instructions, and make driving decisions all in one system.

What's the problem?

The problem is that traditional self-driving systems either separate tasks into different parts, which can be complicated and prone to errors, or use simple end-to-end models that don’t explain their decisions well and struggle in unfamiliar situations. There is a gap in creating models that can both reason about complex scenes and make clear, explainable driving decisions.

What's the solution?

The researchers review different VLA models that combine vision, language, and action into unified systems. These models use language as a way to 'think' about what the car sees and decide what to do, improving both understanding and control. The paper also discusses how these models have evolved, existing challenges like making them robust and fast, and how they are tested with real driving data.

Why it matters?

This matters because improving VLA models helps make autonomous vehicles safer, smarter, and more reliable by enabling them to understand complex driving situations better and follow human instructions more accurately. It also allows these systems to explain their choices, increasing trust and transparency.

Abstract

This survey provides a comprehensive overview of Vision-Language-Action (VLA) models in autonomous driving, detailing their architecture, evolution, and challenges.

View Paper