EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language Models for Vision-Driven Embodied Agents

Rui Yang, Hanyang Chen, Junyu Zhang, Mark Zhao, Cheng Qian, Kangrui Wang, Qineng Wang, Teja Venkat Koripella, Marziyeh Movahedi, Manling Li, Heng Ji, Huan Zhang, Tong Zhang

2025-02-14

EmbodiedBench: Comprehensive Benchmarking Multi-modal Large Language
Models for Vision-Driven Embodied Agents

Summary

This paper talks about EmbodiedBench, a new way to test how well AI systems that use both language and vision can perform tasks in virtual environments. It's like creating a big obstacle course for smart robots to see how well they can understand and act in different situations.

What's the problem?

AI systems that combine language and vision (called MLLMs) are getting really good at understanding things, but it's hard to know how well they can actually do real-world tasks. There wasn't a good way to test these AI 'robots' on a wide range of activities, from simple things like moving around to complex tasks like organizing a house.

What's the solution?

The researchers created EmbodiedBench, which is like a giant test with over 1,000 different tasks in four different virtual environments. These tasks range from easy stuff like navigation to harder things that require planning and understanding complex instructions. They tested 13 different AI systems on these tasks to see how well they performed.

Why it matters?

This matters because as we try to make AI that can help in the real world, we need to know what these systems are good at and where they struggle. EmbodiedBench showed that current AI is pretty good at high-level tasks but still has trouble with detailed physical actions. This information helps researchers know what to focus on to make better AI 'robots' that could one day assist in homes, hospitals, or other real-world settings.

Abstract

Leveraging Multi-modal Large Language Models (MLLMs) to create embodied agents offers a promising avenue for tackling real-world tasks. While language-centric embodied agents have garnered substantial attention, MLLM-based embodied agents remain underexplored due to the lack of comprehensive evaluation frameworks. To bridge this gap, we introduce EmbodiedBench, an extensive benchmark designed to evaluate vision-driven embodied agents. EmbodiedBench features: (1) a diverse set of 1,128 testing tasks across four environments, ranging from high-level semantic tasks (e.g., household) to low-level tasks involving atomic actions (e.g., navigation and manipulation); and (2) six meticulously curated subsets evaluating essential agent capabilities like commonsense reasoning, complex instruction understanding, spatial awareness, visual perception, and long-term planning. Through extensive experiments, we evaluated 13 leading proprietary and open-source MLLMs within EmbodiedBench. Our findings reveal that: MLLMs excel at high-level tasks but struggle with low-level manipulation, with the best model, GPT-4o, scoring only 28.9% on average. EmbodiedBench provides a multifaceted standardized evaluation platform that not only highlights existing challenges but also offers valuable insights to advance MLLM-based embodied agents. Our code is available at https://embodiedbench.github.io.

View Paper