Explain Before You Answer: A Survey on Compositional Visual Reasoning

Fucai Ke, Joy Hsu, Zhixi Cai, Zixian Ma, Xin Zheng, Xindi Wu, Sukai Huang, Weiqing Wang, Pari Delir Haghighi, Gholamreza Haffari, Ranjay Krishna, Jiajun Wu, Hamid Rezatofighi

2025-08-26

Explain Before You Answer: A Survey on Compositional Visual Reasoning

Summary

This paper is a comprehensive overview of the growing field of compositional visual reasoning, which aims to give computers the ability to understand images like humans do – by breaking down scenes, understanding what things *mean* within those scenes, and then using logic to answer questions or solve problems. It looks at research from 2023 to 2025.

What's the problem?

Currently, there wasn't a single, organized resource that summarized all the recent advancements in compositional visual reasoning. Research was scattered across different conferences and approaches, making it hard to see the big picture and understand how the field was evolving. Existing surveys focused on broader areas, not specifically this detailed type of visual understanding.

What's the solution?

The authors reviewed over 260 research papers and organized them into a clear framework. They identified a progression of methods, starting with simple ways to add instructions to existing models, then moving to using external 'tools' to help, and finally to more advanced systems that can think step-by-step and act like intelligent agents. They also categorized the different ways researchers are *testing* these systems, looking at things like accuracy and how well the systems can explain their reasoning. They pinpointed the key challenges and areas for future work.

Why it matters?

This survey is important because it provides a foundational resource for researchers in this field. It helps everyone understand where the field has been, where it is now, and where it needs to go. By clearly outlining the challenges and potential future directions, it aims to inspire new research and accelerate progress towards building AI systems that can truly 'see' and understand the world around them.

Abstract

Compositional visual reasoning has emerged as a key research frontier in multimodal AI, aiming to endow machines with the human-like ability to decompose visual scenes, ground intermediate concepts, and perform multi-step logical inference. While early surveys focus on monolithic vision-language models or general multimodal reasoning, a dedicated synthesis of the rapidly expanding compositional visual reasoning literature is still missing. We fill this gap with a comprehensive survey spanning 2023 to 2025 that systematically reviews 260+ papers from top venues (CVPR, ICCV, NeurIPS, ICML, ACL, etc.). We first formalize core definitions and describe why compositional approaches offer advantages in cognitive alignment, semantic fidelity, robustness, interpretability, and data efficiency. Next, we trace a five-stage paradigm shift: from prompt-enhanced language-centric pipelines, through tool-enhanced LLMs and tool-enhanced VLMs, to recently minted chain-of-thought reasoning and unified agentic VLMs, highlighting their architectural designs, strengths, and limitations. We then catalog 60+ benchmarks and corresponding metrics that probe compositional visual reasoning along dimensions such as grounding accuracy, chain-of-thought faithfulness, and high-resolution perception. Drawing on these analyses, we distill key insights, identify open challenges (e.g., limitations of LLM-based reasoning, hallucination, a bias toward deductive reasoning, scalable supervision, tool integration, and benchmark limitations), and outline future directions, including world-model integration, human-AI collaborative reasoning, and richer evaluation protocols. By offering a unified taxonomy, historical roadmap, and critical outlook, this survey aims to serve as a foundational reference and inspire the next generation of compositional visual reasoning research.

View Paper