< Explain other AI papers

Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable Task Experts

Zaijing Li, Yuquan Xie, Rui Shao, Gongwei Chen, Weili Guan, Dongmei Jiang, Liqiang Nie

2025-06-15

Optimus-3: Towards Generalist Multimodal Minecraft Agents with Scalable
  Task Experts

Summary

This paper talks about Optimus-3, a smart AI agent designed to play Minecraft by understanding and handling different types of tasks like seeing, planning, acting, and reasoning within the game world. It combines many advanced techniques to work better than previous AI systems in Minecraft.

What's the problem?

The problem is that creating an AI that can do many things well in an open and complex environment like Minecraft is very hard. There are challenges like not having enough good training data, different tasks interfering with each other when the AI learns, and the huge variety of visuals in the game that make it difficult for AI to understand and act correctly.

What's the solution?

The solution was to build Optimus-3 using three main ideas. First, it uses a special method to generate high-quality and diverse data based on Minecraft knowledge to train the AI better. Second, it uses a smart system called Mixture-of-Experts that has separate experts focused on different tasks and a shared expert to avoid tasks interfering with each other. Third, it applies a new type of learning called multimodal reasoning-augmented reinforcement learning that teaches the AI to think carefully about what it sees and make better decisions based on visual input. This combination helps the AI improve across many kinds of tasks in Minecraft.

Why it matters?

This matters because making a general AI that can understand and act well in a complex, open-world game like Minecraft shows big progress toward creating AI that can handle many tasks in real life too. It helps in building smarter agents for games, robotics, and other fields where understanding diverse environments and making good decisions is important.

Abstract

Optimus-3, a multimodal large language model agent, uses knowledge-enhanced data generation, a Mixture-of-Experts architecture, and multimodal reasoning-augmented reinforcement learning to achieve superior performance across various tasks in Minecraft.