StreamVLN: Streaming Vision-and-Language Navigation via SlowFast Context Modeling
Meng Wei, Chenyang Wan, Xiqian Yu, Tai Wang, Yuqiang Yang, Xiaohan Mao, Chenming Zhu, Wenzhe Cai, Hanqing Wang, Yilun Chen, Xihui Liu, Jiangmiao Pang
2025-07-09
Summary
This paper talks about StreamVLN, a new system that helps robots or AI agents navigate using both vision and language by processing continuous video streams and instructions with low delay using a slow-fast context model.
What's the problem?
The problem is that previous navigation methods struggle to balance understanding detailed visuals, keeping track of long conversations, and running efficiently, making them less practical for real-time tasks.
What's the solution?
The researchers created StreamVLN, which uses a hybrid design with a fast-updating dialogue context that quickly reacts to current inputs and a slow-updating memory context that compresses past visual information for long-term understanding, improving speed and reasoning over long tasks.
Why it matters?
This matters because StreamVLN allows AI agents to navigate and follow instructions in real-world environments more smoothly and quickly, making them more reliable and useful in applications like robotics and assistive technology.
Abstract
StreamVLN, a streaming VLN framework, uses a hybrid slow-fast context modeling strategy to achieve state-of-the-art performance with low latency and efficient resource usage.