OmniEAR: Benchmarking Agent Reasoning in Embodied Tasks
Zixuan Wang, Dingming Li, Hongxing Li, Shuo Chen, Yuchen Yan, Wenqi Zhang, Yongliang Shen, Weiming Lu, Jun Xiao, Yueting Zhuang
2025-08-12
Summary
This paper talks about OmniEAR, which is a new testing framework made to see how well language models can handle reasoning tasks that involve the physical world. It checks if these AI systems can think through scenarios that include interacting with objects, using tools, and working together with others in both home and industrial environments.
What's the problem?
The problem is that while large language models are good at abstract reasoning, they struggle when it comes to embodied reasoning, which means understanding and acting in the real world through physical interactions or collaboration. Current models have trouble dealing with complex spatial relationships, tool use, and coordinating with others, especially when the information is incomplete or they have to figure out how to work together without explicit instructions.
What's the solution?
To solve this, the researchers created OmniEAR, which uses 1,500 carefully designed scenarios filled with objects and tasks that test various skills like direct commands, using tools, reasoning about attributes, and multi-agent collaboration. This benchmark tests language models in a variety of situations, measuring how well they perform both alone and with others. The system also uses detailed logic to automatically evaluate how well the AI completes these tasks, and it includes human validation to ensure quality.
Why it matters?
This matters because embodied reasoning is crucial for AI to be useful in real-world applications like robotics, smart assistants, and automated systems that need to physically interact with their environment or cooperate with people and other machines. OmniEAR highlights where current models fall short, especially in teamwork and tool use, pointing to architectural gaps that future AI designs need to fix to work better in practical, real-life settings.
Abstract
OmniEAR evaluates language models' embodied reasoning capabilities in physical interactions, tool usage, and multi-agent coordination, revealing performance degradation under constraints and highlighting architectural limitations.