WildWorld: A Large-Scale Dataset for Dynamic World Modeling with Actions and Explicit State toward Generative ARPG
Zhen Li, Zian Meng, Shuwei Shi, Wenshuo Peng, Yuwei Wu, Bo Zheng, Chuanhao Li, Kaipeng Zhang
2026-03-25
Summary
This paper introduces a new dataset called WildWorld, designed to help computers learn how the world changes when actions are taken, specifically within a video game environment.
What's the problem?
Current datasets used to teach computers about how the world works aren't very good because the actions available are limited and directly tied to what things *look* like, rather than the underlying reasons for those changes. Imagine trying to learn to play a game where pressing a button just changes the screen without you understanding *why* – it's hard to learn the actual rules of the game. This makes it difficult for computers to predict what will happen next over longer periods of time and understand the true cause and effect of actions.
What's the solution?
The researchers created WildWorld, a massive dataset gathered from the game Monster Hunter: Wilds. It includes over 108 million video frames and 450 different actions a character can take, like moving, attacking, and using special skills. Importantly, it also includes information about the character’s pose, the state of the game world, camera angles, and depth information – all synchronized with the actions. They also created WildBench, a set of tests to see how well computer models can learn from this data, focusing on following actions and predicting the game state.
Why it matters?
This work is important because it provides a much better resource for training computers to understand and predict how the world changes in response to actions. By including detailed information about the underlying state of the world, it pushes researchers to develop models that can learn more meaningful relationships between actions and outcomes, ultimately leading to more intelligent and capable AI systems that can plan and act effectively in complex environments.
Abstract
Dynamical systems theory and reinforcement learning view world evolution as latent-state dynamics driven by actions, with visual observations providing partial information about the state. Recent video world models attempt to learn this action-conditioned dynamics from data. However, existing datasets rarely match the requirement: they typically lack diverse and semantically meaningful action spaces, and actions are directly tied to visual observations rather than mediated by underlying states. As a result, actions are often entangled with pixel-level changes, making it difficult for models to learn structured world dynamics and maintain consistent evolution over long horizons. In this paper, we propose WildWorld, a large-scale action-conditioned world modeling dataset with explicit state annotations, automatically collected from a photorealistic AAA action role-playing game (Monster Hunter: Wilds). WildWorld contains over 108 million frames and features more than 450 actions, including movement, attacks, and skill casting, together with synchronized per-frame annotations of character skeletons, world states, camera poses, and depth maps. We further derive WildBench to evaluate models through Action Following and State Alignment. Extensive experiments reveal persistent challenges in modeling semantically rich actions and maintaining long-horizon state consistency, highlighting the need for state-aware video generation. The project page is https://shandaai.github.io/wildworld-project/.