CoV: Chain-of-View Prompting for Spatial Reasoning

Haoyu Zhao, Akide Liu, Zeyu Zhang, Weijie Wang, Feng Chen, Ruihan Zhu, Gholamreza Haffari, Bohan Zhuang

2026-01-09

CoV: Chain-of-View Prompting for Spatial Reasoning

Summary

This paper introduces a new method called Chain-of-View, or CoV, to help computer programs answer questions about 3D environments. It focuses on improving how these programs 'look around' to find the information they need.

What's the problem?

Current computer programs that answer questions about 3D scenes struggle because they can only see a limited number of viewpoints at a time. Imagine trying to find something in a room if you could only look at a few pictures – you might miss important clues hidden from those angles. Also, objects can be partially hidden, making it even harder to gather all the necessary information. These programs aren't very good at actively exploring the scene to find what's relevant to the question.

What's the solution?

CoV works by letting the program actively choose where to look. It first quickly scans the scene to find promising viewpoints, then it refines its view by taking small 'steps' to get a better look. It keeps doing this, reasoning about the question as it goes, until it has enough information to answer or it runs out of 'steps'. Importantly, this method doesn't require any extra training of the program; it works with existing models by simply giving them a better way to explore the environment during question answering.

Why it matters?

This research is important because it significantly improves the ability of programs to understand and reason about 3D spaces. By allowing them to actively search for information, CoV makes them more accurate and reliable when answering questions about these environments. The fact that it works with existing models without needing further training makes it a practical and widely applicable solution for improving spatial reasoning in virtual worlds and potentially even robotics.

Abstract

Embodied question answering (EQA) in 3D environments often requires collecting context that is distributed across multiple viewpoints and partially occluded. However, most recent vision--language models (VLMs) are constrained to a fixed and finite set of input views, which limits their ability to acquire question-relevant context at inference time and hinders complex spatial reasoning. We propose Chain-of-View (CoV) prompting, a training-free, test-time reasoning framework that transforms a VLM into an active viewpoint reasoner through a coarse-to-fine exploration process. CoV first employs a View Selection agent to filter redundant frames and identify question-aligned anchor views. It then performs fine-grained view adjustment by interleaving iterative reasoning with discrete camera actions, obtaining new observations from the underlying 3D scene representation until sufficient context is gathered or a step budget is reached. We evaluate CoV on OpenEQA across four mainstream VLMs and obtain an average +11.56\% improvement in LLM-Match, with a maximum gain of +13.62\% on Qwen3-VL-Flash. CoV further exhibits test-time scaling: increasing the minimum action budget yields an additional +2.51\% average improvement, peaking at +3.73\% on Gemini-2.5-Flash. On ScanQA and SQA3D, CoV delivers strong performance (e.g., 116 CIDEr / 31.9 EM@1 on ScanQA and 51.1 EM@1 on SQA3D). Overall, these results suggest that question-aligned view selection coupled with open-view search is an effective, model-agnostic strategy for improving spatial reasoning in 3D EQA without additional training.

View Paper