QuantVLA: Scale-Calibrated Post-Training Quantization for Vision-Language-Action Models
Jingxuan Zhang, Yunta Hsieh, Zhongwei Wang, Haokun Lin, Xin Wang, Ziqi Wang, Yingtie Lei, Mi Zhang
2026-02-25
Summary
This paper introduces a new method called QuantVLA to make complex vision-language-action (VLA) models, which control robots or agents based on what they see and hear, more efficient without losing performance.
What's the problem?
VLA models are getting really big and require a lot of computing power and memory, especially when they need to plan actions over longer periods of time. This makes it difficult to actually use them in real-world applications where resources are limited, like on a robot with a small computer or a device with limited battery life. Specifically, the 'action head' part of these models, often using a 'diffusion transformer' architecture, is particularly hard to shrink in size.
What's the solution?
The researchers developed QuantVLA, a technique that reduces the precision of the numbers used inside the VLA model – essentially simplifying the calculations. It does this *after* the model is already trained, so no extra training is needed. They selectively simplify different parts of the model, keeping the most important parts (attention mechanisms) in higher precision while simplifying others. They also use clever scaling tricks to make sure the simplified model still works well and doesn't lose accuracy, focusing on stabilizing the attention and balancing the outputs. It uses a small amount of example data to calibrate the simplification process.
Why it matters?
This work is important because it allows us to run these powerful VLA models on less powerful hardware, making them practical for real-world robots and agents. It achieves significant savings in memory usage and speeds up processing time, all while maintaining or even improving performance on tasks. This opens the door to more scalable and accessible embodied intelligence, meaning smarter robots and agents that can operate effectively in the real world with limited resources.
Abstract
Vision-language-action (VLA) models unify perception, language, and control for embodied agents but face significant challenges in practical deployment due to rapidly increasing compute and memory demands, especially as models scale to longer horizons and larger backbones. To address these bottlenecks, we introduce QuantVLA, a training-free post-training quantization (PTQ) framework that, to our knowledge, is the first PTQ approach for VLA systems and the first to successfully quantize a diffusion transformer (DiT) action head. QuantVLA incorporates three scale-calibrated components: (1) a selective quantization layout that integerizes all linear layers in both the language backbone and the DiT while keeping attention projections in floating point to preserve the original operator schedule; (2) attention temperature matching, a lightweight per-head scaling mechanism that stabilizes attention logits and is folded into the dequantization scales at inference; and (3) output head balancing, a per-layer residual interface calibration that mitigates post-projection energy drift. The framework requires no additional training, uses only a small unlabeled calibration buffer, and supports integer kernels for low-bit weights and activations while leaving the architecture unchanged. Across representative VLA models on LIBERO, QuantVLA exceeds the task success rates of full-precision baselines, achieves about 70% relative memory savings on the quantized components, and delivers a 1.22x speedup in end-to-end inference latency, providing a practical pathway toward scalable low-bit embodied intelligence under strict compute, memory, and power constraints.