TraceVLA: Visual Trace Prompting Enhances Spatial-Temporal Awareness for Generalist Robotic Policies
Ruijie Zheng, Yongyuan Liang, Shuaiyi Huang, Jianfeng Gao, Hal Daumé III, Andrey Kolobov, Furong Huang, Jianwei Yang
2024-12-16
Summary
This paper discusses a new method called TraceVLA that helps robots better understand how to interact with their environment by improving their awareness of space and time. By using visual cues, this method allows robots to perform complex tasks more effectively.
What's the problem?
Large vision-language-action (VLA) models, which help robots learn from visual data, struggle with understanding how actions change over time and space. This limitation makes them less effective at performing intricate tasks like manipulating objects.
What's the solution?
The researchers introduced visual trace prompting, which enhances the VLA models' ability to predict actions by visually encoding the paths that robots should follow. They developed the TraceVLA model by fine-tuning an existing model called OpenVLA on a dataset of 150,000 robot manipulation examples. This new model showed significant improvements in performance during tests with various robot tasks.
Why it matters?
The TraceVLA framework is important because it allows robots to learn and perform tasks more efficiently, achieving up to 3.5 times better performance in real-world scenarios compared to previous models. This advancement could lead to more capable robots that can handle a wider range of tasks in everyday applications.
Abstract
Although large vision-language-action (VLA) models pretrained on extensive robot datasets offer promising generalist policies for robotic learning, they still struggle with spatial-temporal dynamics in interactive robotics, making them less effective in handling complex tasks, such as manipulation. In this work, we introduce visual trace prompting, a simple yet effective approach to facilitate VLA models' spatial-temporal awareness for action prediction by encoding state-action trajectories visually. We develop a new TraceVLA model by finetuning OpenVLA on our own collected dataset of 150K robot manipulation trajectories using visual trace prompting. Evaluations of TraceVLA across 137 configurations in SimplerEnv and 4 tasks on a physical WidowX robot demonstrate state-of-the-art performance, outperforming OpenVLA by 10% on SimplerEnv and 3.5x on real-robot tasks and exhibiting robust generalization across diverse embodiments and scenarios. To further validate the effectiveness and generality of our method, we present a compact VLA model based on 4B Phi-3-Vision, pretrained on the Open-X-Embodiment and finetuned on our dataset, rivals the 7B OpenVLA baseline while significantly improving inference efficiency.