RoboTracer: Mastering Spatial Trace with Reasoning in Vision-Language Models for Robotics
Enshen Zhou, Cheng Chi, Yibo Li, Jingkun An, Jiayuan Zhang, Shanyu Rong, Yi Han, Yuheng Ji, Mengzhen Liu, Pengwei Wang, Zhongyuan Wang, Lu Sheng, Shanghang Zhang
2025-12-17
Summary
This paper introduces a new system called RoboTracer that helps robots understand and follow complex spatial directions, like tracing a path in a room or on a table. It's about giving robots the ability to 'see' and reason about 3D space to complete tasks.
What's the problem?
Robots struggle with tasks that require them to understand directions involving distances and locations in a 3D environment. Existing methods aren't good at combining multiple steps of spatial reasoning – for example, understanding 'go forward 2 meters, then turn left and go 1 meter' – and accurately measuring distances in the real world. They have trouble with tasks that require both understanding *where* to go and *how far* to go.
What's the solution?
The researchers developed RoboTracer, which is a system that uses a special type of AI called a VLM (Vision-Language Model) that's designed to understand 3D space. It works in two main stages: first, it learns to understand spatial language and measure distances accurately using a lot of training data. Second, it uses reinforcement learning to practice following complex directions and gets rewarded for accurate movements. They also created a large dataset called TraceSpatial with 30 million question-answer pairs to help train and test the system.
Why it matters?
This work is important because it significantly improves a robot's ability to interact with the real world based on natural language instructions. RoboTracer outperforms other systems, even very advanced ones like Gemini-2.5-Pro, and can be used with different types of robots to perform complex tasks in messy, real-world environments. This brings us closer to robots that can truly understand and respond to our commands in a helpful way.
Abstract
Spatial tracing, as a fundamental embodied interaction ability for robots, is inherently challenging as it requires multi-step metric-grounded reasoning compounded with complex spatial referring and real-world metric measurement. However, existing methods struggle with this compositional task. To this end, we propose RoboTracer, a 3D-aware VLM that first achieves both 3D spatial referring and measuring via a universal spatial encoder and a regression-supervised decoder to enhance scale awareness during supervised fine-tuning (SFT). Moreover, RoboTracer advances multi-step metric-grounded reasoning via reinforcement fine-tuning (RFT) with metric-sensitive process rewards, supervising key intermediate perceptual cues to accurately generate spatial traces. To support SFT and RFT training, we introduce TraceSpatial, a large-scale dataset of 30M QA pairs, spanning outdoor/indoor/tabletop scenes and supporting complex reasoning processes (up to 9 steps). We further present TraceSpatial-Bench, a challenging benchmark filling the gap to evaluate spatial tracing. Experimental results show that RoboTracer surpasses baselines in spatial understanding, measuring, and referring, with an average success rate of 79.1%, and also achieves SOTA performance on TraceSpatial-Bench by a large margin, exceeding Gemini-2.5-Pro by 36% accuracy. Notably, RoboTracer can be integrated with various control policies to execute long-horizon, dynamic tasks across diverse robots (UR5, G1 humanoid) in cluttered real-world scenes.