Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents

Dayong Liu, Chao Xu, Weihong Chen, Suyu Zhang, Juncheng Wang, Jiankang Deng, Baigui Sun, Yang Liu

2025-12-03

Beyond Description: Cognitively Benchmarking Fine-Grained Action for Embodied Agents

Summary

This paper introduces a new way to test how well artificial intelligence, specifically Multimodal Large Language Models (MLLMs), can understand and interact with the physical world like a robot would.

What's the problem?

Current tests for these AI models focus on big-picture thinking or understanding where things are in space, but they don't really check if the AI can figure out *how* to do things – like actually manipulating objects or understanding the steps needed for a physical task. Basically, existing benchmarks don't adequately assess the detailed 'action intelligence' needed for a robot to function effectively in a real environment.

What's the solution?

The researchers created a new benchmark called CFG-Bench, which includes a large collection of videos and questions designed to specifically test an AI's ability to understand physical interactions, cause and effect, what people are trying to achieve, and how to judge the success of an action. They then tested several leading MLLMs on this benchmark and also showed that training an MLLM using CFG-Bench data improved its performance on other, established robot-related tests.

Why it matters?

This work is important because it highlights the weaknesses of current AI models when it comes to real-world physical tasks. By identifying these limitations and providing a better way to test them, the researchers are helping to guide the development of more capable and practical AI agents that can actually interact with and operate in our physical world.

Abstract

Multimodal Large Language Models (MLLMs) show promising results as decision-making engines for embodied agents operating in complex, physical environments. However, existing benchmarks often prioritize high-level planning or spatial reasoning, leaving the fine-grained action intelligence required for embodied physical interaction underexplored. To address this gap, we introduce CFG-Bench, a new benchmark designed to systematically evaluate this crucial capability. CFG-Bench consists of 1,368 curated videos paired with 19,562 three-modalities question-answer pairs targeting four cognitive abilities: 1) Physical Interaction, 2) Temporal-Causal Relation, 3) Intentional Understanding, and 4) Evaluative Judgment. Together, these dimensions provide a systematic framework for assessing a model's ability to translate visual observations into actionable knowledge, moving beyond mere surface-level recognition. Our comprehensive evaluation on CFG-Bench reveals that leading MLLMs struggle to produce detailed instructions for physical interactions and exhibit profound limitations in the higher-order reasoning of intention and evaluation. Moreover, supervised fine-tuning (SFT) on our data demonstrates that teaching an MLLMs to articulate fine-grained actions directly translates to significant performance gains on established embodied benchmarks. Our analysis highlights these limitations and offers insights for developing more capable and grounded embodied agents.

View Paper