BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities

Yu Qi, Haibo Zhao, Ziyu Guo, Siyuan Ma, Ziyan Chen, Yaokun Han, Renrui Zhang, Zitiantao Lin, Shiji Xin, Yijian Huang, Kai Cheng, Peiheng Wang, Jiazheng Liu, Jiayi Zhang, Yizhe Zhu, Wenqing Wang, Yiran Qin, Xupeng Zhu, Haojie Huang, Lawson L. S. Wong

2025-10-13

BEAR: Benchmarking and Enhancing Multimodal Language Models for Atomic Embodied Capabilities

Summary

This paper investigates how well current AI models, specifically those that can understand both images/videos and text, can actually *do* things in the real world – things that require understanding their surroundings and interacting with them. They introduce a new, detailed test called BEAR to measure these abilities.

What's the problem?

Existing tests for these AI models are too focused on specific skills like planning or understanding where things are in space. They don't give a complete picture of how well the AI can handle basic, everyday tasks that require understanding the physical world, like pointing at something or figuring out where an object is going. Essentially, we don't have a good way to systematically check if these AI models are truly 'embodied' – meaning they can understand and interact with the world like humans do.

What's the solution?

The researchers created BEAR, a large and detailed test with over 4,400 different scenarios across 14 different areas, testing everything from simple actions to complex planning. They then tested 20 different AI models on BEAR and found they all struggled. To improve performance, they built BEAR-Agent, a new AI system that combines existing AI vision models with the language model to give it better perception, spatial understanding, and planning skills. BEAR-Agent performed significantly better on the BEAR test, showing a substantial improvement over a powerful model called GPT-5.

Why it matters?

This work is important because it provides a better way to evaluate AI models that are intended to operate in the real world, like robots or virtual assistants. By identifying the weaknesses of current models, the researchers can focus on improving their abilities, ultimately leading to more capable and helpful AI systems. It also shows that improving an AI's understanding of the physical world can make it better at a wider range of tasks, even in simulated environments.

Abstract

Embodied capabilities refer to a suite of fundamental abilities for an agent to perceive, comprehend, and interact with the physical world. While multimodal large language models (MLLMs) show promise as embodied agents, a thorough and systematic evaluation of their embodied capabilities remains underexplored, as existing benchmarks primarily focus on specific domains such as planning or spatial understanding. To bridge this gap, we introduce BEAR, a comprehensive and fine-grained benchmark that evaluates MLLMs on atomic embodied capabilities. BEAR comprises 4,469 interleaved image-video-text entries across 14 domains in 6 categories, including tasks from low-level pointing, trajectory understanding, spatial reasoning, to high-level planning. Extensive evaluation results of 20 representative MLLMs reveal their persistent limitations across all domains of embodied capabilities. To tackle the shortfall, we propose BEAR-Agent, a multimodal conversable agent that integrates pretrained vision models to strengthen MLLM perception, 3D understanding, and planning capabilities. It substantially enhances MLLM performance across diverse embodied capabilities on BEAR, yielding a 9.12% absolute gain and a relative improvement of 17.5% on GPT-5. Furthermore, our experiments indicate that improving MLLM embodied capabilities can benefit embodied tasks in simulated environments. Project website: https://bear-official66.github.io/

View Paper