From Perception to Action: An Interactive Benchmark for Vision Reasoning

Yuhao Wu, Maojia Song, Yihuai Lan, Lei Wang, Zhiqiang Hu, Yao Xiao, Heng Zhou, Weihua Zheng, Dylan Raharja, Soujanya Poria, Roy Ka-Wei Lee

2026-02-25

From Perception to Action: An Interactive Benchmark for Vision Reasoning

Summary

This research introduces a new way to test how well artificial intelligence, specifically Vision-Language Models, understand the physical world and how things work together.

What's the problem?

Current tests for these AI models mostly ask simple questions about images, like 'What color is this?' This doesn't check if the AI can actually *reason* about physics – things like how objects support each other, how they connect, or what happens when you try to move them. Basically, existing tests don't see if an AI can plan a series of actions in a realistic, physical environment.

What's the solution?

The researchers created a virtual environment called CHAIN, which presents AI with challenges that require understanding physical relationships. These challenges include things like solving mechanical puzzles and stacking blocks. They then tested several advanced AI models in this environment, having them interact and try to complete these tasks.

Why it matters?

This is important because if we want robots or AI assistants to be truly helpful in the real world – like assembling furniture or packing a box – they need to understand how physics works. The results show that even the best AI models still struggle with this, meaning there's a lot of work to be done before we have AI that can reliably interact with the physical world.

Abstract

Understanding the physical structure is essential for real-world applications such as embodied agents, interactive design, and long-horizon manipulation. Yet, prevailing Vision-Language Model (VLM) evaluations still center on structure-agnostic, single-turn setups (e.g., VQA), which fail to assess agents' ability to reason about how geometry, contact, and support relations jointly constrain what actions are possible in a dynamic environment. To address this gap, we introduce the Causal Hierarchy of Actions and Interactions (CHAIN) benchmark, an interactive 3D, physics-driven testbed designed to evaluate whether models can understand, plan, and execute structured action sequences grounded in physical constraints. CHAIN shifts evaluation from passive perception to active problem solving, spanning tasks such as interlocking mechanical puzzles and 3D stacking and packing. We conduct a comprehensive study of state-of-the-art VLMs and diffusion-based models under unified interactive settings. Our results show that top-performing models still struggle to internalize physical structure and causal constraints, often failing to produce reliable long-horizon plans and cannot robustly translate perceived structure into effective actions. The project is available at https://social-ai-studio.github.io/CHAIN/.

View Paper