An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges

Chao Xu, Suyu Zhang, Yang Liu, Baigui Sun, Weihong Chen, Bo Xu, Qi Liu, Juncheng Wang, Shujun Wang, Shan Luo, Jan Peters, Athanasios V. Vasilakos, Stefanos Zafeiriou, Jiankang Deng

2025-12-22

An Anatomy of Vision-Language-Action Models: From Modules to Milestones and Challenges

Summary

This paper is a comprehensive overview of Vision-Language-Action (VLA) models, which are a new type of AI that allows robots to understand human instructions and perform tasks in the real world. It's like a guidebook to this rapidly growing field.

What's the problem?

The field of VLA models is developing incredibly quickly, with lots of new research and data appearing all the time. This makes it hard for researchers, especially those new to the area, to understand what's been done, what the key issues are, and where the field is heading. Essentially, it's a problem of keeping up with a fast-moving target.

What's the solution?

The authors created a structured survey of VLA models. They break down these models into their basic building blocks, then trace the important developments in the field. Most importantly, they identify and deeply analyze the five biggest challenges currently facing VLA research: how to best represent information, how to reliably execute actions, how to make models work in new situations, how to ensure safety, and how to create good datasets for training and testing. They also provide a website with ongoing updates to the survey.

Why it matters?

This paper is important because it provides a central resource for anyone working with or learning about VLA models. It helps newcomers get up to speed quickly and gives experienced researchers a roadmap for future work. By clearly outlining the key challenges, it aims to focus research efforts and accelerate progress towards creating truly intelligent and helpful robots.

Abstract

Vision-Language-Action (VLA) models are driving a revolution in robotics, enabling machines to understand instructions and interact with the physical world. This field is exploding with new models and datasets, making it both exciting and challenging to keep pace with. This survey offers a clear and structured guide to the VLA landscape. We design it to follow the natural learning path of a researcher: we start with the basic Modules of any VLA model, trace the history through key Milestones, and then dive deep into the core Challenges that define recent research frontier. Our main contribution is a detailed breakdown of the five biggest challenges in: (1) Representation, (2) Execution, (3) Generalization, (4) Safety, and (5) Dataset and Evaluation. This structure mirrors the developmental roadmap of a generalist agent: establishing the fundamental perception-action loop, scaling capabilities across diverse embodiments and environments, and finally ensuring trustworthy deployment-all supported by the essential data infrastructure. For each of them, we review existing approaches and highlight future opportunities. We position this paper as both a foundational guide for newcomers and a strategic roadmap for experienced researchers, with the dual aim of accelerating learning and inspiring new ideas in embodied intelligence. A live version of this survey, with continuous updates, is maintained on our https://suyuz1.github.io/Survery/{project page}.

View Paper