VLingNav: Embodied Navigation with Adaptive Reasoning and Visual-Assisted Linguistic Memory
Shaoan Wang, Yuanfei Luo, Xingyu Chen, Aocheng Luo, Dongyue Li, Chang Liu, Sheng Chen, Yangang Zhang, Junzhi Yu
2026-01-14
Summary
This paper introduces VLingNav, a new AI model designed to help robots navigate complex environments. It builds on existing AI that combines vision and language, but adds more sophisticated 'thinking' and memory capabilities.
What's the problem?
Current AI models for robot navigation often react directly to what they 'see' without really planning or remembering where they've been. This makes them struggle with tasks that require long-term planning, like finding a specific object across a large space or navigating changing environments. They lack the ability to reason through steps and remember past experiences to avoid getting stuck or repeating mistakes.
What's the solution?
VLingNav tackles this by mimicking how humans think. It uses a system that decides *when* to actively think through a problem (like planning a route) and *when* to just act instinctively. It also creates a 'memory' that combines visual information with language, allowing the robot to recall past observations and understand spatial relationships over longer distances. The researchers also created a large dataset specifically for training this kind of reasoning ability in robots, and used a training method that combines learning from examples with learning through trial and error.
Why it matters?
This research is important because it represents a significant step towards robots that can navigate the real world more effectively and independently. VLingNav’s ability to reason and remember allows it to perform better on challenging navigation tasks and even transfer its skills to real-world robots without needing further training, opening the door for more practical and adaptable robotic systems.
Abstract
VLA models have shown promising potential in embodied navigation by unifying perception and planning while inheriting the strong generalization abilities of large VLMs. However, most existing VLA models rely on reactive mappings directly from observations to actions, lacking the explicit reasoning capabilities and persistent memory required for complex, long-horizon navigation tasks. To address these challenges, we propose VLingNav, a VLA model for embodied navigation grounded in linguistic-driven cognition. First, inspired by the dual-process theory of human cognition, we introduce an adaptive chain-of-thought mechanism, which dynamically triggers explicit reasoning only when necessary, enabling the agent to fluidly switch between fast, intuitive execution and slow, deliberate planning. Second, to handle long-horizon spatial dependencies, we develop a visual-assisted linguistic memory module that constructs a persistent, cross-modal semantic memory, enabling the agent to recall past observations to prevent repetitive exploration and infer movement trends for dynamic environments. For the training recipe, we construct Nav-AdaCoT-2.9M, the largest embodied navigation dataset with reasoning annotations to date, enriched with adaptive CoT annotations that induce a reasoning paradigm capable of adjusting both when to think and what to think about. Moreover, we incorporate an online expert-guided reinforcement learning stage, enabling the model to surpass pure imitation learning and to acquire more robust, self-explored navigation behaviors. Extensive experiments demonstrate that VLingNav achieves state-of-the-art performance across a wide range of embodied navigation benchmarks. Notably, VLingNav transfers to real-world robotic platforms in a zero-shot manner, executing various navigation tasks and demonstrating strong cross-domain and cross-task generalization.