EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models
Yantai Yang, Yuhao Wang, Zichen Wen, Luo Zhongwei, Chang Zou, Zhipeng Zhang, Chuan Wen, Linfeng Zhang
2025-06-18
Summary
This paper talks about EfficientVLA, a method that makes Vision-Language-Action models faster and smaller by cutting out some language parts, picking important visual pieces, and saving intermediate results in the action part of the model.
What's the problem?
The problem is that Vision-Language-Action models, which help robots understand what they see, hear, and need to do, can be very slow and use a lot of computer power because they have to process a lot of information all at once.
What's the solution?
The researchers improved these models by removing unnecessary parts in the language processing, choosing only the important pieces from images to focus on, and storing temporary information to avoid repeating work. This helped the models work faster and with less computer effort.
Why it matters?
This matters because making these models more efficient helps robots and AI systems react quicker and use less energy, which is important for real-time tasks like driving cars, assisting humans, and working in controlled environments.
Abstract
EfficientVLA accelerates Vision-Language-Action models by pruning language layers, optimizing visual token selection, and caching intermediate features in the diffusion-based action head.