Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds

Weihao Tan, Xiangyang Li, Yunhao Fang, Heyuan Yao, Shi Yan, Hao Luo, Tenglong Ao, Huihui Li, Hongbin Ren, Bairen Yi, Yujia Qin, Bo An, Libin Liu, Guang Shi

2025-11-13

Lumine: An Open Recipe for Building Generalist Agents in 3D Open Worlds

Summary

This paper introduces Lumine, a new artificial intelligence system designed to play complex video games for extended periods, mimicking how a human player would interact with the game.

What's the problem?

Creating AI that can truly master open-world video games is really hard. Existing AI often struggles with the long-term planning, adapting to changing situations, and understanding the game world like a person does. They usually can't handle the combination of visually understanding the game, figuring out what to do, and then actually *doing* it all at the same time for hours on end.

What's the solution?

The researchers built Lumine, which uses a powerful vision-language model to process what it 'sees' in the game (the pixels on the screen). It then uses this understanding to decide what actions to take, controlling the game with keyboard and mouse inputs. Importantly, Lumine doesn't constantly 'think' – it only uses more complex reasoning when it needs to, making it efficient. They trained it by having it play through the entire main story of Genshin Impact, and then tested it on other games without any further training.

Why it matters?

Lumine is a big step forward because it shows we're getting closer to creating AI agents that can operate effectively in complex, open-ended environments. The fact that it can play multiple games, even ones it wasn't specifically trained on, suggests it's learning general skills that can be applied to many different situations, bringing us closer to truly versatile AI.

Abstract

We introduce Lumine, the first open recipe for developing generalist agents capable of completing hours-long complex missions in real time within challenging 3D open-world environments. Lumine adopts a human-like interaction paradigm that unifies perception, reasoning, and action in an end-to-end manner, powered by a vision-language model. It processes raw pixels at 5 Hz to produce precise 30 Hz keyboard-mouse actions and adaptively invokes reasoning only when necessary. Trained in Genshin Impact, Lumine successfully completes the entire five-hour Mondstadt main storyline on par with human-level efficiency and follows natural language instructions to perform a broad spectrum of tasks in both 3D open-world exploration and 2D GUI manipulation across collection, combat, puzzle-solving, and NPC interaction. In addition to its in-domain performance, Lumine demonstrates strong zero-shot cross-game generalization. Without any fine-tuning, it accomplishes 100-minute missions in Wuthering Waves and the full five-hour first chapter of Honkai: Star Rail. These promising results highlight Lumine's effectiveness across distinct worlds and interaction dynamics, marking a concrete step toward generalist agents in open-ended environments.

View Paper