Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

Meng Wei, Chenyang Wan, Jiaqi Peng, Xiqian Yu, Yuqiang Yang, Delin Feng, Wenzhe Cai, Chenming Zhu, Tai Wang, Jiangmiao Pang, Xihui Liu

2025-12-10

Ground Slow, Move Fast: A Dual-System Foundation Model for Generalizable Vision-and-Language Navigation

Summary

This paper introduces a new approach to helping robots navigate using both visual information and natural language instructions, called DualVLN.

What's the problem?

Current robots using vision and language to navigate often make jerky, fragmented movements and are slow to react to changes in their environment, like people or objects moving around. They typically try to directly translate instructions into actions without much planning, which makes them struggle in realistic situations.

What's the solution?

The researchers created a system with two parts. The first part, 'System 2', acts like a high-level planner, slowly figuring out a series of waypoints to reach the goal based on what it 'sees' in images. The second part, 'System 1', is a faster, more nimble controller that actually moves the robot, using both the visual goal and information from the planner to create smooth and accurate paths. By separating planning and execution, the robot can adapt to changing conditions in real-time.

Why it matters?

This work is important because it makes robots much better at navigating complex, real-world environments. It allows for more natural and efficient movement, and the robot can handle unexpected obstacles without getting stuck or confused. This is a step towards robots that can truly assist people in everyday tasks.

Abstract

While recent large vision-language models (VLMs) have improved generalization in vision-language navigation (VLN), existing methods typically rely on end-to-end pipelines that map vision-language inputs directly to short-horizon discrete actions. Such designs often produce fragmented motions, incur high latency, and struggle with real-world challenges like dynamic obstacle avoidance. We propose DualVLN, the first dual-system VLN foundation model that synergistically integrates high-level reasoning with low-level action execution. System 2, a VLM-based global planner, "grounds slowly" by predicting mid-term waypoint goals via image-grounded reasoning. System 1, a lightweight, multi-modal conditioning Diffusion Transformer policy, "moves fast" by leveraging both explicit pixel goals and latent features from System 2 to generate smooth and accurate trajectories. The dual-system design enables robust real-time control and adaptive local decision-making in complex, dynamic environments. By decoupling training, the VLM retains its generalization, while System 1 achieves interpretable and effective local navigation. DualVLN outperforms prior methods across all VLN benchmarks and real-world experiments demonstrate robust long-horizon planning and real-time adaptability in dynamic environments.

View Paper