A Survey on Efficient Vision-Language-Action Models

Zhaoshu Yu, Bo Wang, Pengpeng Zeng, Haonan Zhang, Ji Zhang, Lianli Gao, Jingkuan Song, Nicu Sebe, Heng Tao Shen

2025-11-03

A Survey on Efficient Vision-Language-Action Models

Summary

This paper is a comprehensive overview of how to make Vision-Language-Action models, which are AI systems that understand both images/language and can take actions in the real world, more practical and efficient to use.

What's the problem?

These advanced AI models require a huge amount of computing power and data to work well, making them difficult and expensive to deploy in real-world situations like robotics. Essentially, they're too big and hungry for resources to be widely used.

What's the solution?

The paper organizes all the current research trying to solve this problem into three main areas. First, they look at ways to design the models themselves to be smaller and faster. Second, they examine techniques to make the training process, where the AI learns, less computationally intensive. Finally, they explore methods for collecting and using data from robots more efficiently. They created a system for categorizing all these different approaches and provide a detailed review of the best methods currently available.

Why it matters?

Making these models more efficient is crucial for bringing embodied AI – AI that can interact with the physical world – out of the lab and into everyday applications. This survey provides a valuable resource for researchers and developers working in this field, helping them understand the current state-of-the-art and identify future research directions.

Abstract

Vision-Language-Action models (VLAs) represent a significant frontier in embodied intelligence, aiming to bridge digital knowledge with physical-world interaction. While these models have demonstrated remarkable generalist capabilities, their deployment is severely hampered by the substantial computational and data requirements inherent to their underlying large-scale foundation models. Motivated by the urgent need to address these challenges, this survey presents the first comprehensive review of Efficient Vision-Language-Action models (Efficient VLAs) across the entire data-model-training process. Specifically, we introduce a unified taxonomy to systematically organize the disparate efforts in this domain, categorizing current techniques into three core pillars: (1) Efficient Model Design, focusing on efficient architectures and model compression; (2) Efficient Training, which reduces computational burdens during model learning; and (3) Efficient Data Collection, which addresses the bottlenecks in acquiring and utilizing robotic data. Through a critical review of state-of-the-art methods within this framework, this survey not only establishes a foundational reference for the community but also summarizes representative applications, delineates key challenges, and charts a roadmap for future research. We maintain a continuously updated project page to track our latest developments: https://evla-survey.github.io/

View Paper