10 Open Challenges Steering the Future of Vision-Language-Action Models
Soujanya Poria, Navonil Majumder, Chia-Yu Hung, Amir Ali Bagherzadeh, Chuan Li, Kenneth Kwok, Ziwei Wang, Cheston Tan, Jiajun Wu, David Hsu
2025-11-11
Summary
This paper reviews the recent progress and future directions in vision-language-action (VLA) models, which are AI systems that can understand both images and language to then perform actions in the real world.
What's the problem?
Creating AI that can truly understand and interact with the world like humans is really hard. Existing AI often struggles to combine what it 'sees' (vision), what it 'reads' or is 'told' (language), and then actually *do* something about it (action). The paper identifies ten key challenges in building these VLA models, ranging from making them understand complex instructions to ensuring they are safe and can work with people.
What's the solution?
The authors don't present a single new solution, but instead analyze ten important areas where progress has been made in VLA models. These areas include things like improving how the AI processes different types of information (multimodality), making it better at logical thinking (reasoning), and finding ways to train the AI with less data (efficiency). They also discuss new trends like helping the AI understand space and how the world changes over time, and using techniques to improve models after they've already been trained.
Why it matters?
This work is important because it provides a roadmap for researchers working on embodied AI – AI that exists in the physical world. By highlighting these key challenges and emerging trends, the paper helps focus research efforts and accelerate the development of more capable and useful AI assistants and robots that can understand our instructions and help us with real-world tasks.
Abstract
Due to their ability of follow natural language instructions, vision-language-action (VLA) models are increasingly prevalent in the embodied AI arena, following the widespread success of their precursors -- LLMs and VLMs. In this paper, we discuss 10 principal milestones in the ongoing development of VLA models -- multimodality, reasoning, data, evaluation, cross-robot action generalization, efficiency, whole-body coordination, safety, agents, and coordination with humans. Furthermore, we discuss the emerging trends of using spatial understanding, modeling world dynamics, post training, and data synthesis -- all aiming to reach these milestones. Through these discussions, we hope to bring attention to the research avenues that may accelerate the development of VLA models into wider acceptability.