Steering Vision-Language-Action Models as Anti-Exploration: A Test-Time Scaling Approach
Siyuan Yang, Yang Zhang, Haoran He, Ling Pan, Xiu Li, Chenjia Bai, Xuelong Li
2025-12-04
Summary
This paper focuses on improving how well Vision-Language-Action (VLA) models perform tasks after they've been initially trained on lots of different data. These models are good at learning complex actions, but can become unstable and unreliable when applied to specific new tasks.
What's the problem?
VLA models learn from a huge variety of data, including examples of how humans perform actions. However, humans aren't always perfect! The data can contain unnecessary or even bad ways of doing things. When these models are fine-tuned for a specific task, they sometimes get confused by these irrelevant actions and become inconsistent in their performance, meaning they don't always succeed even when given the same instructions and starting conditions.
What's the solution?
The researchers developed a method called TACO, which stands for Test-Time Scaling. TACO works by checking how 'likely' each possible action is during the task. It does this without changing the model's core programming – it only affects how the model *uses* its knowledge when performing the task. TACO favors actions that seem more natural and consistent with successful examples, effectively filtering out the confusing, suboptimal actions the model might have learned from the messy initial data. It's like giving the model a quick 'sanity check' before it acts.
Why it matters?
This work is important because it makes VLA models much more reliable and practical. By improving their stability, we can trust them to perform tasks consistently, which is crucial for real-world applications like robotics. Also, TACO is efficient because it doesn't require retraining the model, making it especially useful for complex VLA models that are difficult to update.
Abstract
Vision-Language-Action (VLA) models, trained via flow-matching or diffusion objectives, excel at learning complex behaviors from large-scale, multi-modal datasets (e.g., human teleoperation, scripted policies). However, since VLAs incorporate diverse data modes in the pre-training stage, and the finetuning dataset often contains demonstration data collected in a kinematically suboptimal or undesirable way, it exists redundant action modes that are irrelevant to the success action modes of the downstream task. Specifically, we observe a critical inference-time fragility among various sampled noises after supervised finetuning of pre-trained VLAs. In this paper, we attribute this instability to the distribution shift between the VLA policy and the policy induced by stable success modes of the downstream task dataset. Thus, we propose TACO, a test-time-scaling (TTS) framework that applies a lightweight pseudo-count estimator as a high-fidelity verifier of action chunks. The VLA models integrated with TACO can execute the actions with maximum pseudo-count from all sampled action chunks, thereby preventing distribution shifts while preserving the generalization ability of VLAs since the constraint is applied only during inference. Our method resembles the classical anti-exploration principle in offline reinforcement learning (RL), and being gradient-free, it incurs significant computational benefits compared to RL update, especially for flow or diffusion-based VLAs which are difficult to perform RL update due to denoising process. Extensive experiments across four simulation benchmarks (RoboTwin2.0, Robotwin, LIBERO, SimplerEnv) and a dual-arm platform demonstrate that our method significantly improves the inference stability and success rates in downstream-task adaptations.