VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation

Hanyu Zhou, Chuanhao Ma, Gim Hee Lee

2025-11-24

VLA-4D: Embedding 4D Awareness into Vision-Language-Action Models for SpatioTemporally Coherent Robotic Manipulation

Summary

This paper introduces a new model called VLA-4D designed to help robots perform complex tasks by understanding both what they're seeing and what actions to take over time.

What's the problem?

Current robots using vision and language to understand instructions often struggle with tasks that require precise movements and timing, like smoothly manipulating objects. They can 'see' things and 'know' what to do, but coordinating those actions in a fluid, realistic way is difficult. Existing methods try to give robots a better sense of space, but they often fail to maintain consistent action execution over time.

What's the solution?

The researchers created VLA-4D, which gives the robot a stronger understanding of both space *and* time. They do this by combining visual information with a sense of when and where actions happen. Specifically, they added time as another dimension to the robot's understanding of object positions, and then they improved how the robot plans and executes actions by considering both spatial location and timing. They also created a more detailed dataset to help train the model.

Why it matters?

This work is important because it moves robots closer to being able to perform more complex and realistic tasks in the real world. By improving a robot’s ability to understand and coordinate actions over time, it opens the door for robots to help with things like assembly, cooking, or any task requiring delicate and precise movements.

Abstract

Vision-language-action (VLA) models show potential for general robotic tasks, but remain challenging in spatiotemporally coherent manipulation, which requires fine-grained representations. Typically, existing methods embed 3D positions into visual representations to enhance the spatial precision of actions. However, these methods struggle to achieve temporally coherent control over action execution. In this work, we propose VLA-4D, a general VLA model with 4D awareness for spatiotemporally coherent robotic manipulation. Our model is guided by two key designs: 1) 4D-aware visual representation. We extract visual features, embed 1D time into 3D positions for 4D embeddings, and fuse them into a unified visual representation via a cross-attention mechanism. 2) Spatiotemporal action representation. We extend conventional spatial action representations with temporal information to enable the spatiotemporal planning, and align the multimodal representations into the LLM for spatiotemporal action prediction. Within this unified framework, the designed visual and action representations jointly make robotic manipulation spatially-smooth and temporally-coherent. In addition, we extend the VLA dataset with temporal action annotations for fine-tuning our model. Extensive experiments have been conducted to verify the superiority of our method across different tasks of robotic manipulation.

View Paper