A Survey on Vision-Language-Action Models: An Action Tokenization Perspective
Yifan Zhong, Fengshuo Bai, Shaofei Cai, Xuchuan Huang, Zhang Chen, Xiaowei Zhang, Yuanfei Wang, Shaoyang Guo, Tianrui Guan, Ka Nam Lui, Zhiquan Qi, Yitao Liang, Yuanpei Chen, Yaodong Yang
2025-07-03
Summary
This paper talks about vision-language-action (VLA) models, which are AI systems that combine what they see, understand from language, and how they move or act, all in one model. These models help robots or AI agents perform tasks by interpreting images and commands and then deciding how to act accordingly.
What's the problem?
The problem is that old AI systems usually handled vision, language, and actions separately, making it hard for robots to understand instructions and act in the real world smoothly. Integrating all these abilities into one system is challenging but necessary for smarter robots.
What's the solution?
The researchers studied many VLA models and categorized them based on how they turn visual, language, and action data into tokens the model can process. They analyzed different designs, training methods, and ways these models predict actions from visual and language inputs to improve understanding and performance.
Why it matters?
This matters because combining vision, language, and action into one system lets robots and AI assistants understand complex tasks more like humans. It helps create better robots that can learn from people, follow natural instructions, and behave more effectively in real-world environments.
Abstract
This survey categorizes and analyzes vision-language-action models through the lens of action tokenization, identifying strengths, limitations, and future directions.