MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

Zhiheng Song, Jingshuai Zhang, Chuan Qin, Chao Wang, Chao Chen, Longfei Xu, Kaikui Liu, Xiangxiang Chu, Hengshu Zhu

2026-02-27

MobilityBench: A Benchmark for Evaluating Route-Planning Agents in Real-World Mobility Scenarios

Summary

This paper introduces a new way to test computer programs designed to help people find the best routes to get somewhere, like a navigation app. These programs use powerful AI called large language models, which are good at understanding and responding to natural language, but haven't been thoroughly tested in real-world situations.

What's the problem?

Testing these AI route planners is really hard because people ask for routes in many different ways, map services aren't always consistent, and it's difficult to get the same results every time you run a test. It's tough to know if a bad route is because of the AI or just a glitch in the map data or something else changing during the test.

What's the solution?

The researchers created something called MobilityBench. This is a large collection of real route requests from users of a map app, covering many cities. They also built a special testing environment that replays these requests in a controlled way, so everything is predictable and reproducible. They then used this setup to evaluate how well different AI route planners perform, looking at things like whether the routes are valid, if the AI understands the requests, and how efficiently it finds solutions.

Why it matters?

This work is important because it provides a standard way to measure and improve AI route planners. The tests showed that current AI models are pretty good at simple route requests, but struggle when people have specific preferences, like avoiding highways or wanting scenic routes. This highlights areas where the AI needs to get better to truly personalize navigation and make it more useful for everyone.

Abstract

Route-planning agents powered by large language models (LLMs) have emerged as a promising paradigm for supporting everyday human mobility through natural language interaction and tool-mediated decision making. However, systematic evaluation in real-world mobility settings is hindered by diverse routing demands, non-deterministic mapping services, and limited reproducibility. In this study, we introduce MobilityBench, a scalable benchmark for evaluating LLM-based route-planning agents in real-world mobility scenarios. MobilityBench is constructed from large-scale, anonymized real user queries collected from Amap and covers a broad spectrum of route-planning intents across multiple cities worldwide. To enable reproducible, end-to-end evaluation, we design a deterministic API-replay sandbox that eliminates environmental variance from live services. We further propose a multi-dimensional evaluation protocol centered on outcome validity, complemented by assessments of instruction understanding, planning, tool use, and efficiency. Using MobilityBench, we evaluate multiple LLM-based route-planning agents across diverse real-world mobility scenarios and provide an in-depth analysis of their behaviors and performance. Our findings reveal that current models perform competently on Basic information retrieval and Route Planning tasks, yet struggle considerably with Preference-Constrained Route Planning, underscoring significant room for improvement in personalized mobility applications. We publicly release the benchmark data, evaluation toolkit, and documentation at https://github.com/AMAP-ML/MobilityBench .

View Paper