MineWorld: a Real-Time and Open-Source Interactive World Model on Minecraft
Junliang Guo, Yang Ye, Tianyu He, Haoyu Wu, Yushu Jiang, Tim Pearce, Jiang Bian
2025-04-14
Summary
This paper talks about MineWorld, a real-time, open-source AI model that can predict and generate what happens next in the game Minecraft based on both what you see and what actions you take. The model uses a special kind of AI called a visual-action autoregressive Transformer, which learns from lots of Minecraft gameplay to create new game scenes that match your inputs, making it feel like a real interactive world.
What's the problem?
The problem is that most AI models for generating game worlds or videos are either too slow to work in real time or can't follow player actions closely enough, making them less useful for interactive games like Minecraft. It's also tough to measure how well these models actually respond to what players do, and existing models often struggle to generate high-quality, believable game scenes quickly.
What's the solution?
MineWorld solves this by turning both the game's visuals and player actions into special codes, or tokens, and feeding them into a Transformer model that predicts what should happen next. The researchers invented a parallel decoding method that lets the model generate several parts of the game scene at once, speeding things up so it can keep up with real gameplay. They also created new ways to test how well the model follows actions, showing that MineWorld is both faster and more accurate than other open-source models.
Why it matters?
This work matters because it sets a new standard for creating interactive, AI-driven worlds that can react to players in real time. MineWorld could help make video games, simulations, and virtual environments more immersive and responsive, while also providing a powerful tool for research and AI development in open-ended digital worlds.
Abstract
MineWorld, a real-time interactive world model built on Minecraft, uses a visual-action autoregressive Transformer to generate consequent game scenes based on inputs of game scenes and actions, outperforming state-of-the-art diffusion-based models.