VLASH: Real-Time VLAs via Future-State-Aware Asynchronous Inference
Jiaming Tang, Yufei Sun, Yilong Zhao, Shang Yang, Yujun Lin, Zhuoyang Zhang, James Hou, Yao Lu, Zhijian Liu, Song Han
2025-12-02
Summary
This paper introduces a new method, VLASH, to make robots that use both vision and language commands react more quickly and smoothly in the real world.
What's the problem?
Robots using vision and language instructions are often slow to respond because they need to fully process information *before* acting. Current systems often use sped-up videos for training, which doesn't reflect real-time conditions, and struggle with delays when things change unexpectedly in their environment. Trying to fix this usually means sacrificing accuracy or slowing down the robot even further.
What's the solution?
VLASH solves this by letting the robot start acting *while* it's still figuring things out. It predicts where the robot will be after taking an action and uses that prediction to adjust its understanding of the situation. Essentially, it's like the robot is thinking one step ahead to stay synchronized with the real world, without needing to change the core way the robot's brain works or adding extra processing steps.
Why it matters?
This is important because it allows robots to perform tasks that require fast reactions and precise movements, like playing ping-pong or whack-a-mole, which were previously too difficult for these types of systems. It makes robots more efficient and capable of operating effectively in dynamic, real-world environments, bringing us closer to robots that can truly assist us in everyday life.
Abstract
Vision-Language-Action models (VLAs) are becoming increasingly capable across diverse robotic tasks. However, their real-world deployment remains slow and inefficient: demonstration videos are often sped up by 5-10x to appear smooth, with noticeable action stalls and delayed reactions to environmental changes. Asynchronous inference offers a promising solution to achieve continuous and low-latency control by enabling robots to execute actions and perform inference simultaneously. However, because the robot and environment continue to evolve during inference, a temporal misalignment arises between the prediction and execution intervals. This leads to significant action instability, while existing methods either degrade accuracy or introduce runtime overhead to mitigate it. We propose VLASH, a general asynchronous inference framework for VLAs that delivers smooth, accurate, and fast reaction control without additional overhead or architectural changes. VLASH estimates the future execution-time state by rolling the robot state forward with the previously generated action chunk, thereby bridging the gap between prediction and execution. Experiments show that VLASH achieves up to 2.03x speedup and reduces reaction latency by up to 17.4x compared to synchronous inference while fully preserving the original accuracy. Moreover, it empowers VLAs to handle fast-reaction, high-precision tasks such as playing ping-pong and playing whack-a-mole, where traditional synchronous inference fails. Code is available at https://github.com/mit-han-lab/vlash