SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead

Chaojun Ni, Cheng Chen, Xiaofeng Wang, Zheng Zhu, Wenzhao Zheng, Boyuan Wang, Tianrun Chen, Guosheng Zhao, Haoyun Li, Zhehao Dong, Qiang Zhang, Yun Ye, Yang Wang, Guan Huang, Wenjun Mei

2025-12-03

SwiftVLA: Unlocking Spatiotemporal Dynamics for Lightweight VLA Models at Minimal Overhead

Summary

This paper introduces SwiftVLA, a new way to build models that understand videos, language, and actions happening within those videos. It focuses on making these models smaller and faster without losing accuracy.

What's the problem?

Current models that understand videos and actions are really big, making them impractical to use on devices like phones or robots. While using smaller models is an option, they often struggle to understand how things move and change over time in a video. Adding 3D information helps, but usually requires a large model to process it all and still doesn't fully grasp the timing of actions.

What's the solution?

SwiftVLA tackles this by using a pre-trained system that's good at understanding the shape and movement of objects in a video (4D visual geometry transformer). It then uses something called 'Fusion Tokens' which are like special connectors that help the model combine information from regular 2D images and this 4D movement data. Finally, it trains the model to 'fill in the blanks' by reconstructing the 4D movement data from the images, which helps it learn to understand movement effectively. Importantly, the 4D part can be removed later without much loss of performance, making it even faster.

Why it matters?

This work is important because it allows for more efficient video understanding models. SwiftVLA achieves performance comparable to much larger models, but is significantly faster, uses less memory, and can run on less powerful devices like edge devices. This opens the door for more real-world applications of video understanding, like robotics and mobile apps.

Abstract

Vision-Language-Action (VLA) models built on pretrained Vision-Language Models (VLMs) show strong potential but are limited in practicality due to their large parameter counts. To mitigate this issue, using a lightweight VLM has been explored, but it compromises spatiotemporal reasoning. Although some methods suggest that incorporating additional 3D inputs can help, they usually rely on large VLMs to fuse 3D and 2D inputs and still lack temporal understanding. Therefore, we propose SwiftVLA, an architecture that enhances a compact model with 4D understanding while preserving design efficiency. Specifically, our approach features a pretrained 4D visual geometry transformer with a temporal cache that extracts 4D features from 2D images. Then, to enhance the VLM's ability to exploit both 2D images and 4D features, we introduce Fusion Tokens, a set of learnable tokens trained with a future prediction objective to generate unified representations for action generation. Finally, we introduce a mask-and-reconstruct strategy that masks 4D inputs to the VLM and trains the VLA to reconstruct them, enabling the VLM to learn effective 4D representations and allowing the 4D branch to be dropped at inference with minimal performance loss. Experiments in real and simulated environments show that SwiftVLA outperforms lightweight baselines and rivals VLAs up to 7 times larger, achieving comparable performance on edge devices while being 18 times faster and reducing memory footprint by 12 times.

View Paper