V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

Chenrui Fan, Yijun Liang, Shweta Bhardwaj, Kwesi Cobbina, Ming Li, Tianyi Zhou

2025-12-16

V-REX: Benchmarking Exploratory Visual Reasoning via Chain-of-Questions

Summary

This paper focuses on the difficulty current vision-language models (VLMs) have when dealing with complex tasks that require them to actively explore and think through visual information, rather than just answering simple, direct questions.

What's the problem?

Existing VLMs are really good at answering specific questions about images, like 'what color is the car?' But they struggle with tasks that need more back-and-forth investigation, like 'find the hidden object in this room.' These complex tasks require the AI to ask itself questions, look at different parts of the image, and reason step-by-step, which is hard to test and improve because there are so many possible ways to approach them.

What's the solution?

The researchers created a new testing set called V-REX, which presents VLMs with challenging visual reasoning problems. V-REX breaks down these problems into a series of questions the AI should ask itself to find the answer. This 'Chain-of-Questions' approach lets them evaluate two key skills: 'Planning' – figuring out *which* questions to ask, and 'Following' – actually answering those questions to gather information. By limiting the possible questions and answers, they can accurately measure how well the AI is doing at each step.

Why it matters?

This work is important because it highlights a major weakness in current AI systems – their inability to think and explore visually like humans do. By providing a way to specifically test and measure these skills, the researchers can help guide the development of more intelligent and capable VLMs that can handle real-world, complex tasks.

Abstract

While many vision-language models (VLMs) are developed to answer well-defined, straightforward questions with highly specified targets, as in most benchmarks, they often struggle in practice with complex open-ended tasks, which usually require multiple rounds of exploration and reasoning in the visual space. Such visual thinking paths not only provide step-by-step exploration and verification as an AI detective but also produce better interpretations of the final answers. However, these paths are challenging to evaluate due to the large exploration space of intermediate steps. To bridge the gap, we develop an evaluation suite, ``Visual Reasoning with multi-step EXploration (V-REX)'', which is composed of a benchmark of challenging visual reasoning tasks requiring native multi-step exploration and an evaluation protocol. V-REX covers rich application scenarios across diverse domains. V-REX casts the multi-step exploratory reasoning into a Chain-of-Questions (CoQ) and disentangles VLMs' capability to (1) Planning: breaking down an open-ended task by selecting a chain of exploratory questions; and (2) Following: answering curated CoQ sequentially to collect information for deriving the final answer. By curating finite options of questions and answers per step, V-REX achieves a reliable quantitative and fine-grained analysis of the intermediate steps. By assessing SOTA proprietary and open-sourced VLMs, we reveal consistent scaling trends, significant differences between planning and following abilities, and substantial room for improvement in multi-step exploratory reasoning.

View Paper