Toward Ambulatory Vision: Learning Visually-Grounded Active View Selection
Juil Koo, Daehyeon Choi, Sangwoo Youn, Phillip Y. Lee, Minhyuk Sung
2025-12-16
Summary
This paper focuses on improving how AI systems 'look around' to answer questions about what they see, moving beyond just analyzing single pictures.
What's the problem?
Current AI models that combine vision and language, like those used for visual question answering, are limited because they only process static images. They can't actively choose better viewpoints to get more information, unlike how humans or robots explore a scene. Essentially, they lack 'ambulatory vision' – the ability to move and see from different angles to understand things better.
What's the solution?
The researchers created a new task called Visually Grounded Active View Selection (VG-AVS) where an AI has to pick the most helpful next viewpoint based *only* on what it currently sees, without remembering past views or using outside knowledge. They built a simulated environment to train the AI, and then used a two-step process: first, they fine-tuned existing vision-language models with supervised learning, and then they used reinforcement learning to further improve the AI’s ability to choose good viewpoints. This allows the AI to learn a 'policy' for selecting the best views.
Why it matters?
This work is important because it helps AI systems become more effective at understanding the real world. By enabling AI to actively seek out better views, it improves their ability to answer questions about complex scenes, and can be integrated into systems that explore environments and answer questions about them, making them more accurate and useful.
Abstract
Vision Language Models (VLMs) excel at visual question answering (VQA) but remain limited to snapshot vision, reasoning from static images. In contrast, embodied agents require ambulatory vision, actively moving to obtain more informative views. We introduce Visually Grounded Active View Selection (VG-AVS), a task that selects the most informative next viewpoint using only the visual information in the current image, without relying on scene memory or external knowledge. To support this task, we construct a synthetic dataset with automatically generated paired query-target views and question-answer prompts. We also propose a framework that fine-tunes pretrained VLMs through supervised fine-tuning (SFT) followed by RL-based policy optimization. Our approach achieves strong question answering performance based on viewpoint selection and generalizes robustly to unseen synthetic and real scenes. Furthermore, incorporating our learned VG-AVS framework into existing scene-exploration-based EQA systems improves downstream question-answering accuracy.