Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Dongmei Jiang, Liqiang Nie

2024-08-08

Optimus-1: Hybrid Multimodal Memory Empowered Agents Excel in Long-Horizon Tasks

Summary

This paper introduces Optimus-1, a new type of AI agent designed to handle complex tasks that require long-term planning and understanding of the world around it.

What's the problem?

Many existing AI agents struggle with long-horizon tasks, which are tasks that take a long time to complete and require reasoning over extended periods. This difficulty often arises because these agents lack the necessary knowledge about the world and experience to guide them through these complex challenges.

What's the solution?

To solve this problem, the authors developed a Hybrid Multimodal Memory module for Optimus-1. This module includes two key components: a Hierarchical Directed Knowledge Graph (HDKG) that helps the agent learn and represent world knowledge, and an Abstracted Multimodal Experience Pool (AMEP) that summarizes past experiences. By combining these elements, Optimus-1 can plan and reflect on tasks more effectively, making it capable of completing long-horizon tasks in environments like Minecraft. The model has shown significantly better performance compared to other agents in various benchmarks.

Why it matters?

This research is important because it represents a significant step towards creating AI systems that can operate more like humans, handling complex tasks that require both knowledge and experience. By improving how agents learn and remember information, Optimus-1 could lead to advancements in fields such as robotics, gaming, and any area where long-term planning is crucial.

Abstract

Building a general-purpose agent is a long-standing vision in the field of artificial intelligence. Existing agents have made remarkable progress in many domains, yet they still struggle to complete long-horizon tasks in an open world. We attribute this to the lack of necessary world knowledge and multimodal experience that can guide agents through a variety of long-horizon tasks. In this paper, we propose a Hybrid Multimodal Memory module to address the above challenges. It 1) transforms knowledge into Hierarchical Directed Knowledge Graph that allows agents to explicitly represent and learn world knowledge, and 2) summarises historical information into Abstracted Multimodal Experience Pool that provide agents with rich references for in-context learning. On top of the Hybrid Multimodal Memory module, a multimodal agent, Optimus-1, is constructed with dedicated Knowledge-guided Planner and Experience-Driven Reflector, contributing to a better planning and reflection in the face of long-horizon tasks in Minecraft. Extensive experimental results show that Optimus-1 significantly outperforms all existing agents on challenging long-horizon task benchmarks, and exhibits near human-level performance on many tasks. In addition, we introduce various Multimodal Large Language Models (MLLMs) as the backbone of Optimus-1. Experimental results show that Optimus-1 exhibits strong generalization with the help of the Hybrid Multimodal Memory module, outperforming the GPT-4V baseline on many tasks.

View Paper