Mem4Nav: Boosting Vision-and-Language Navigation in Urban Environments with a Hierarchical Spatial-Cognition Long-Short Memory System
Lixuan He, Haoyu Dong, Zhenxing Chen, Yangcheng Yu, Jie Feng, Yong Li
2025-06-25
Summary
This paper talks about Mem4Nav, a system that helps robots or AI agents navigate in big, complex cities by combining vision, language understanding, and a special memory system that remembers important spatial details.
What's the problem?
The problem is that navigating in large urban areas is hard for AI because it needs to understand instructions, keep track of where it is, remember landmarks, and avoid obstacles in real time, and most existing methods either can't remember long-term information well or struggle to handle the complexity of city environments.
What's the solution?
The researchers created a hierarchical spatial-cognition system that stores detailed maps at different scales and links landmarks together in a graph. They use a dual-memory design with long-term memory to keep historical information and short-term memory for current surroundings. This memory system is integrated through a reversible Transformer model, allowing the AI to quickly access and use past and present data for better navigation, task completion, and faster decision-making.
Why it matters?
This matters because it makes AI agents much better at understanding both language and vision together to move through real-world cities more efficiently, which can be useful for delivery robots, self-driving cars, and other applications that require smart navigation in busy urban places.
Abstract
Mem4Nav enhances Vision-and-Language Navigation by integrating a hierarchical spatial-cognition system with dual-memory modules using a reversible Transformer for improved task completion, speed, and detour detection.