EfficientVLA accelerates Vision-Language-Action models by pruning language layers, optimizing visual token selection, and caching intermediate features in the diffusion-based action head.

This paper talks about EfficientVLA, a method that makes Vision-Language-Action models faster and smaller by cutting out some language parts, picking important visual pieces, and saving intermediate results in the action part of the model.

EfficientVLA: Training-Free Acceleration and Compression for Vision-Language-Action Models

Summary

What's the problem?

What's the solution?

Why it matters?

Abstract