From Masks to Worlds: A Hitchhiker's Guide to World Models

Jinbin Bai, Yu Lei, Hecong Wu, Yuchen Zhu, Shufan Li, Yi Xin, Xiangtai Li, Molei Tao, Aditya Grover, Ming-Hsuan Yang

2025-10-24

From Masks to Worlds: A Hitchhiker's Guide to World Models

Summary

This paper isn't just a review of existing 'world model' research, it's a practical guide for anyone wanting to *create* these models. It doesn't try to cover everything ever done in the field, but instead focuses on a specific, successful path towards building truly intelligent systems.

What's the problem?

The field of 'world models' was becoming scattered, with lots of different approaches that weren't necessarily building on each other effectively. There was a need to identify the core components and the most promising direction for future research, rather than just listing everything that had been tried.

What's the solution?

The authors trace the development of world models through four key stages: starting with models that learn to represent information from different sources like images and text, then moving to unified designs, then adding the ability for the model to interact with its 'world' and finally incorporating memory to maintain a consistent understanding over time. They argue that focusing on these three elements – generation, interaction, and memory – is the key.

Why it matters?

This work is important because it provides a clear roadmap for researchers. By identifying the central ideas and the most effective progression of techniques, it helps to streamline efforts and accelerate progress towards building artificial intelligence that can understand and interact with the world in a more human-like way.

Abstract

This is not a typical survey of world models; it is a guide for those who want to build worlds. We do not aim to catalog every paper that has ever mentioned a ``world model". Instead, we follow one clear road: from early masked models that unified representation learning across modalities, to unified architectures that share a single paradigm, then to interactive generative models that close the action-perception loop, and finally to memory-augmented systems that sustain consistent worlds over time. We bypass loosely related branches to focus on the core: the generative heart, the interactive loop, and the memory system. We show that this is the most promising path towards true world models.

View Paper