MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory

Bo Wang, Jiehong Lin, Chenzhi Liu, Xinting Hu, Yifei Yu, Tianjia Liu, Zhongrui Wang, Xiaojuan Qi

2025-12-03

MG-Nav: Dual-Scale Visual Navigation via Sparse Spatial Memory

Summary

This paper introduces a new system called MG-Nav that helps robots navigate through real-world environments using only visual information, even in places they've never seen before.

What's the problem?

Getting a robot to navigate a new place is hard because it needs to understand where it is, plan a route to a goal, and avoid obstacles, all without having a pre-made map. Existing methods often struggle with long distances, changing environments, or recognizing objects from different viewpoints.

What's the solution?

MG-Nav tackles this by using a two-part approach. First, it builds a 'memory' of the environment using key visual features and spatial relationships, organizing this memory into a graph. The robot then plans a high-level route through this memory graph. Second, it uses a separate system to follow that route, carefully controlling its movements and switching between following waypoints and directly aiming for the final visual goal. A special module, called VGGT-adapter, helps the robot better understand the 3D layout of the scene and align its view with the goal.

Why it matters?

This research is important because it allows robots to navigate more effectively in unknown and dynamic environments, which is crucial for applications like delivery services, search and rescue, and even helping people with disabilities. It represents a step forward in creating robots that can operate independently in the real world.

Abstract

We present MG-Nav (Memory-Guided Navigation), a dual-scale framework for zero-shot visual navigation that unifies global memory-guided planning with local geometry-enhanced control. At its core is the Sparse Spatial Memory Graph (SMG), a compact, region-centric memory where each node aggregates multi-view keyframe and object semantics, capturing both appearance and spatial structure while preserving viewpoint diversity. At the global level, the agent is localized on SMG and a goal-conditioned node path is planned via an image-to-instance hybrid retrieval, producing a sequence of reachable waypoints for long-horizon guidance. At the local level, a navigation foundation policy executes these waypoints in point-goal mode with obstacle-aware control, and switches to image-goal mode when navigating from the final node towards the visual target. To further enhance viewpoint alignment and goal recognition, we introduce VGGT-adapter, a lightweight geometric module built on the pre-trained VGGT model, which aligns observation and goal features in a shared 3D-aware space. MG-Nav operates global planning and local control at different frequencies, using periodic re-localization to correct errors. Experiments on HM3D Instance-Image-Goal and MP3D Image-Goal benchmarks demonstrate that MG-Nav achieves state-of-the-art zero-shot performance and remains robust under dynamic rearrangements and unseen scene conditions.

View Paper