NaviTrace: Evaluating Embodied Navigation of Vision-Language Models
Tim Windecker, Manthan Patel, Moritz Reuss, Richard Schwarzkopf, Cesar Cadena, Rudolf Lioutikov, Marco Hutter, Jonas Frey
2025-11-04
Summary
This paper introduces a new way to test how well vision-language models can be used to control robots for navigation, essentially giving them instructions and seeing if they can 'draw' a good path to follow in an image.
What's the problem?
Currently, testing these models for robot navigation is difficult because real-world tests are expensive and time-consuming, while simulations are often too simple to be realistic and existing tests aren't comprehensive enough. There's a need for a better, more reliable way to measure how well these models understand instructions and can plan a route for a robot.
What's the solution?
The researchers created a benchmark called NaviTrace. It presents a model with a navigation instruction and the type of robot it's controlling (like a human, a legged robot, or a wheeled robot) and asks it to draw a path on an image. They collected over 3000 paths created by humans to use as a standard for comparison. They also developed a scoring system that considers how closely the model's path matches the human paths, how accurately it reaches the goal, and whether the path makes sense for the specific robot type.
Why it matters?
This work is important because it provides a standardized and scalable way to evaluate vision-language models for robotic navigation. By identifying weaknesses in spatial understanding and goal localization, it helps researchers improve these models and move closer to building robots that can navigate the real world effectively based on natural language instructions.
Abstract
Vision-language models demonstrate unprecedented performance and generalization across a wide range of tasks and scenarios. Integrating these foundation models into robotic navigation systems opens pathways toward building general-purpose robots. Yet, evaluating these models' navigation capabilities remains constrained by costly real-world trials, overly simplified simulations, and limited benchmarks. We introduce NaviTrace, a high-quality Visual Question Answering benchmark where a model receives an instruction and embodiment type (human, legged robot, wheeled robot, bicycle) and must output a 2D navigation trace in image space. Across 1000 scenarios and more than 3000 expert traces, we systematically evaluate eight state-of-the-art VLMs using a newly introduced semantic-aware trace score. This metric combines Dynamic Time Warping distance, goal endpoint error, and embodiment-conditioned penalties derived from per-pixel semantics and correlates with human preferences. Our evaluation reveals consistent gap to human performance caused by poor spatial grounding and goal localization. NaviTrace establishes a scalable and reproducible benchmark for real-world robotic navigation. The benchmark and leaderboard can be found at https://leggedrobotics.github.io/navitrace_webpage/.