PhysVLM-AVR: Active Visual Reasoning for Multimodal Large Language Models in Physical Environments
Weijie Zhou, Xuantang Xiong, Yi Peng, Manli Tao, Chaoyang Zhao, Honghui Dong, Ming Tang, Jinqiao Wang
2025-10-27
Summary
This paper explores how well large language models that can understand both text and images, called multimodal large language models or MLLMs, can solve problems that require them to actively gather information. It's about making these models more like humans, who don't just passively look at things but move around and interact to figure things out.
What's the problem?
Current MLLMs are good at reasoning about images, but only when they can see everything at once. In the real world, things are often hidden or you need to change your viewpoint to understand a situation. These models struggle when information is incomplete and they can't just 'see' the answer immediately, because they don't actively try to get more information.
What's the solution?
The researchers created a new challenge called Active Visual Reasoning (AVR) where models have to take actions – like moving a camera – to gather information and then use that information to solve a problem. They also built a simulated environment called CLEVR-AVR and a large dataset with detailed explanations of how to think through these problems step-by-step. Finally, they developed a new model, PhysVLM-AVR, that performs well on this task and other related challenges.
Why it matters?
This work is important because it identifies a key weakness in current MLLMs: their inability to actively seek out information. Improving this 'active reasoning' ability is crucial for building AI systems that can truly understand and interact with the real world, going beyond just analyzing static images or text. It pushes the field towards more capable and adaptable AI agents.
Abstract
Visual reasoning in multimodal large language models (MLLMs) has primarily been studied in static, fully observable settings, limiting their effectiveness in real-world environments where information is often incomplete due to occlusion or limited field of view. Humans, in contrast, actively explore and interact with their environment-moving, examining, and manipulating objects-to gather information through a closed-loop process integrating perception, reasoning, and action. Inspired by this human capability, we introduce the Active Visual Reasoning (AVR) task, extending visual reasoning to partially observable, interactive environments. AVR necessitates agents to: (1) actively acquire information via sequential physical actions, (2) integrate observations across multiple steps for coherent reasoning, and (3) dynamically adjust decisions based on evolving visual feedback. To rigorously evaluate AVR, we introduce CLEVR-AVR, a simulation benchmark featuring multi-round interactive environments designed to assess both reasoning correctness and information-gathering efficiency. We present AVR-152k, a large-scale dataset that offers rich Chain-of-Thought (CoT) annotations detailing iterative reasoning for uncertainty identification, action-conditioned information gain prediction, and information-maximizing action selection, crucial for training agents in a higher-order Markov Decision Process. Building on this, we develop PhysVLM-AVR, an MLLM achieving state-of-the-art performance on CLEVR-AVR, embodied reasoning (OpenEQA, RoboVQA), and passive visual reasoning (GeoMath, Geometry30K). Our analysis also reveals that current embodied MLLMs, despite detecting information incompleteness, struggle to actively acquire and integrate new information through interaction, highlighting a fundamental gap in active reasoning capabilities.