SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

Gengze Zhou, Yicong Hong, Zun Wang, Chongyang Zhao, Mohit Bansal, Qi Wu

2024-12-13

SAME: Learning Generic Language-Guided Visual Navigation with State-Adaptive Mixture of Experts

Summary

This paper talks about SAME, a new model designed to improve how agents navigate through environments using natural language instructions by combining different levels of guidance.

What's the problem?

Navigating using language instructions can be tricky because there are two main types of tasks: high-level searches that focus on exploring and low-level tasks that require following detailed commands. Previous methods often struggled to effectively combine these approaches, making it hard for agents to understand and act on the instructions they receive.

What's the solution?

SAME introduces a unified framework that allows agents to learn from various types of navigation tasks. It uses a State-Adaptive Mixture of Experts model, which means it can adapt its decision-making based on the type of language instruction and the current situation. This model can handle multiple navigation tasks at once, improving performance compared to models that focus on just one type of task.

Why it matters?

This research is important because it enhances the ability of AI agents to understand and follow complex instructions in real-world scenarios. By improving how these agents navigate using language, SAME can lead to better applications in robotics, virtual reality, and other fields where effective navigation is crucial.

Abstract

The academic field of learning instruction-guided visual navigation can be generally categorized into high-level category-specific search and low-level language-guided navigation, depending on the granularity of language instruction, in which the former emphasizes the exploration process, while the latter concentrates on following detailed textual commands. Despite the differing focuses of these tasks, the underlying requirements of interpreting instructions, comprehending the surroundings, and inferring action decisions remain consistent. This paper consolidates diverse navigation tasks into a unified and generic framework -- we investigate the core difficulties of sharing general knowledge and exploiting task-specific capabilities in learning navigation and propose a novel State-Adaptive Mixture of Experts (SAME) model that effectively enables an agent to infer decisions based on different-granularity language and dynamic observations. Powered by SAME, we present a versatile agent capable of addressing seven navigation tasks simultaneously that outperforms or achieves highly comparable performance to task-specific agents.

View Paper