3D and 4D World Modeling: A Survey
Lingdong Kong, Wesley Yang, Jianbiao Mei, Youquan Liu, Ao Liang, Dekai Zhu, Dongyue Lu, Wei Yin, Xiaotao Hu, Mingkai Jia, Junyuan Deng, Kaiwen Zhang, Yang Wu, Tianyi Yan, Shenyuan Gao, Song Wang, Linfeng Li, Liang Pan, Yong Liu, Jianke Zhu, Wei Tsang Ooi, Steven C. H. Hoi
2025-09-11
Summary
This paper is a comprehensive overview of how artificial intelligence is learning to understand and recreate the 3D world around it, focusing on methods that go beyond just looking at pictures and videos.
What's the problem?
Currently, AI that tries to model the world often focuses on 2D images and videos. However, there's a lot of progress being made using more detailed 3D data like depth sensors, occupancy maps, and LiDAR scans, and this hasn't been well documented. Also, there isn't a clear, agreed-upon definition of what a 'world model' actually *is*, leading to confusion and inconsistent results in research.
What's the solution?
The authors created a detailed survey that specifically looks at AI techniques for building 3D and 4D world models. They clearly define what a world model is, categorize different approaches based on the type of data used (video, occupancy grids, or LiDAR), and summarize the datasets and ways to measure how well these models work. They also provide a link to a resource with a systematic summary of the research they reviewed.
Why it matters?
This work is important because it provides a central resource for researchers working on AI that interacts with the physical world. By clarifying definitions, categorizing methods, and highlighting challenges, it helps to organize the field and guide future research towards building more capable and realistic AI systems. It's like creating a roadmap for building AI that can truly 'understand' its surroundings.
Abstract
World modeling has become a cornerstone in AI research, enabling agents to understand, represent, and predict the dynamic environments they inhabit. While prior work largely emphasizes generative methods for 2D image and video data, they overlook the rapidly growing body of work that leverages native 3D and 4D representations such as RGB-D imagery, occupancy grids, and LiDAR point clouds for large-scale scene modeling. At the same time, the absence of a standardized definition and taxonomy for ``world models'' has led to fragmented and sometimes inconsistent claims in the literature. This survey addresses these gaps by presenting the first comprehensive review explicitly dedicated to 3D and 4D world modeling and generation. We establish precise definitions, introduce a structured taxonomy spanning video-based (VideoGen), occupancy-based (OccGen), and LiDAR-based (LiDARGen) approaches, and systematically summarize datasets and evaluation metrics tailored to 3D/4D settings. We further discuss practical applications, identify open challenges, and highlight promising research directions, aiming to provide a coherent and foundational reference for advancing the field. A systematic summary of existing literature is available at https://github.com/worldbench/survey