MM-CondChain: A Programmatically Verified Benchmark for Visually Grounded Deep Compositional Reasoning
Haozhan Shen, Shilin Yan, Hongwei Xue, Shuaiqi Lu, Xiaojun Tang, Guannan Zhang, Tiancheng Zhao, Jianwei Yin
2026-03-16
Summary
This paper introduces a new way to test how well AI models that can 'see' and 'understand' language can follow complex, step-by-step instructions based on what they see in an image. These models, called Multimodal Large Language Models, are being used for tasks like automating actions on a computer screen, but current tests aren't challenging enough.
What's the problem?
Existing tests for these AI models don't really push them to think deeply about images and follow a long chain of 'if-then' conditions. For example, a good model should be able to handle instructions like 'If you see a permission box *and* the screen is green, click 'Allow''. Current benchmarks either test simple conditions or treat each condition as separate, not connected steps. This means we don't know how well they can handle real-world tasks that require careful, multi-step reasoning based on visual information.
What's the solution?
The researchers created a new benchmark called MM-CondChain. This benchmark presents the AI with a series of visual challenges organized like a workflow, where each step depends on correctly identifying things in the image and following a specific rule. To build this benchmark, they developed a system that automatically creates these complex instructions, making sure each step can be verified as correct. They tested this benchmark on different types of images – regular photos, charts, and screenshots of computer interfaces.
Why it matters?
This work is important because it shows that even the most advanced AI models still struggle with complex visual reasoning. They aren't very good at following long chains of instructions that require them to carefully analyze images and make decisions based on multiple factors. Improving this ability is crucial for building AI that can reliably automate tasks and interact with the world around us.
Abstract
Multimodal Large Language Models (MLLMs) are increasingly used to carry out visual workflows such as navigating GUIs, where the next step depends on verified visual compositional conditions (e.g., "if a permission dialog appears and the color of the interface is green, click Allow") and the process may branch or terminate early. Yet this capability remains under-evaluated: existing benchmarks focus on shallow-compositions or independent-constraints rather than deeply chained compositional conditionals. In this paper, we introduce MM-CondChain, a benchmark for visually grounded deep compositional reasoning. Each benchmark instance is organized as a multi-layer reasoning chain, where every layer contains a non-trivial compositional condition grounded in visual evidence and built from multiple objects, attributes, or relations. To answer correctly, an MLLM must perceive the image in detail, reason over multiple visual elements at each step, and follow the resulting execution path to the final outcome. To scalably construct such workflow-style data, we propose an agentic synthesis pipeline: a Planner orchestrates layer-by-layer generation of compositional conditions, while a Verifiable Programmatic Intermediate Representation (VPIR) ensures each layer's condition is mechanically verifiable. A Composer then assembles these verified layers into complete instructions. Using this pipeline, we construct benchmarks across three visual domains: natural images, data charts, and GUI trajectories. Experiments on a range of MLLMs show that even the strongest model attains only 53.33 Path F1, with sharp drops on hard negatives and as depth or predicate complexity grows, confirming that deep compositional reasoning remains a fundamental challenge.