MineWorld

A key innovation of MineWorld is its Diagonal Decoding algorithm, a parallel decoding method that allows the model to generate groups of spatially related tokens simultaneously. This dramatically accelerates inference, achieving real-time frame rates of 4 to 7 frames per second—fast enough for interactive play even with experienced gamers. The model’s architecture ensures high generation quality and strong controllability, faithfully following user actions and maintaining visual consistency across frames. MineWorld is also equipped with new evaluation metrics that assess not just the visual fidelity of generated scenes but also how accurately the model’s outputs adhere to the intended actions, setting a new benchmark for action-following capacity in world modeling.

Beyond its simulation capabilities, MineWorld can function as both a world model and a policy model, predicting future states and actions to serve as an autonomous game agent. This dual functionality opens up possibilities for research in reinforcement learning, agent training, and interactive storytelling within the Minecraft universe. While currently limited to Minecraft data at a fixed resolution, MineWorld’s open-source release includes code, model weights, and setup tools, making it accessible for researchers, developers, and hobbyists interested in virtual environments, generative modeling, and interactive simulations.

Key features include:

Visual-action autoregressive Transformer for high-fidelity scene and action generation
Diagonal Decoding for real-time frame rates (4–7 fps) and efficient parallel inference
Strong controllability and action-following capacity for interactive gameplay
Dual capability as a world model and policy model for autonomous agent simulation
Open-source release with code, model weights, and easy setup for experimentation

Subscribe to the AI Search Newsletter