VL-LN Bench: Towards Long-horizon Goal-oriented Navigation with Active Dialogs

Wensi Huang, Shaohao Zhu, Meng Wei, Jinming Xu, Xihui Liu, Hanqing Wang, Tai Wang, Feng Zhao, Jiangmiao Pang

2025-12-30

VL-LN Bench: Towards Long-horizon Goal-oriented Navigation with Active Dialogs

Summary

This paper introduces a new challenge for robots learning to navigate: understanding and responding to unclear instructions through conversation. It's about making robots better at figuring out what people *mean* when they give directions, not just following exact commands.

What's the problem?

Current robot navigation research focuses on robots following very specific instructions, like 'go to the kitchen' or 'find the red cup'. However, in the real world, people often give vague or incomplete directions, like 'over there' or 'near the couch'. Robots need to be able to ask clarifying questions to understand what a person wants, something existing research doesn't really address.

What's the solution?

The researchers created a new task called Interactive Instance Object Navigation (IION) where a robot has to navigate while also being able to ask questions to a person (called an 'oracle') to get more information. They also built a large dataset, called VL-LN, with over 41,000 examples of these conversations and navigation paths, allowing them to train and test robots on this more realistic scenario. They then built a robot model that can both navigate and have a conversation, and showed it performs better than existing methods.

Why it matters?

This work is important because it moves robot navigation closer to real-world usefulness. If robots can understand and ask for clarification when instructions are unclear, they'll be much more helpful and reliable in everyday situations. The new dataset provides a valuable resource for other researchers to build upon and improve dialog-enabled navigation systems.

Abstract

In most existing embodied navigation tasks, instructions are well-defined and unambiguous, such as instruction following and object searching. Under this idealized setting, agents are required solely to produce effective navigation outputs conditioned on vision and language inputs. However, real-world navigation instructions are often vague and ambiguous, requiring the agent to resolve uncertainty and infer user intent through active dialog. To address this gap, we propose Interactive Instance Object Navigation (IION), a task that requires agents not only to generate navigation actions but also to produce language outputs via active dialog, thereby aligning more closely with practical settings. IION extends Instance Object Navigation (ION) by allowing agents to freely consult an oracle in natural language while navigating. Building on this task, we present the Vision Language-Language Navigation (VL-LN) benchmark, which provides a large-scale, automatically generated dataset and a comprehensive evaluation protocol for training and assessing dialog-enabled navigation models. VL-LN comprises over 41k long-horizon dialog-augmented trajectories for training and an automatic evaluation protocol with an oracle capable of responding to agent queries. Using this benchmark, we train a navigation model equipped with dialog capabilities and show that it achieves significant improvements over the baselines. Extensive experiments and analyses further demonstrate the effectiveness and reliability of VL-LN for advancing research on dialog-enabled embodied navigation. Code and dataset: https://0309hws.github.io/VL-LN.github.io/

View Paper