World Modeling with Probabilistic Structure Integration
Klemen Kotar, Wanhee Lee, Rahul Venkatesh, Honglin Chen, Daniel Bear, Jared Watrous, Simon Kim, Khai Loong Aw, Lilian Naing Chen, Stefan Stojanov, Kevin Feigelis, Imran Thobani, Alex Durango, Khaled Jedoui, Atlas Kazemian, Dan Yamins
2025-09-15
Summary
This paper introduces a new system called Probabilistic Structure Integration, or PSI, which is designed to learn how the world works from lots of data, specifically video. It aims to create a model that can not only predict what will happen next in a video but also be easily controlled and adapted to different tasks.
What's the problem?
Current AI models often struggle to truly *understand* the underlying structure of complex data like video. They can predict, but they don't necessarily grasp the important elements and relationships within the data. This makes it hard to control what the model does or to get it to perform new tasks without a lot of retraining. Essentially, they lack a flexible way to represent and manipulate the core concepts within the data.
What's the solution?
PSI works in a three-step cycle. First, it builds a detailed probabilistic model of the video data, figuring out how different parts of the video relate to each other. Second, it automatically identifies key 'intermediate structures' – think of these as fundamental properties like movement or depth – within the video. Finally, it integrates these structures back into the model as new ways to control and improve its predictions. This cycle repeats, making the model smarter and more controllable with each iteration.
Why it matters?
This research is important because it represents a step towards creating AI systems that can learn and reason about the world in a more human-like way. By learning the underlying structure of data, PSI can perform tasks like predicting future frames in a video, understanding object movements, and even improving its own performance over time. This could lead to more versatile and powerful AI applications in areas like robotics, video editing, and self-driving cars.
Abstract
We present Probabilistic Structure Integration (PSI), a system for learning richly controllable and flexibly promptable world models from data. PSI consists of a three-step cycle. The first step, Probabilistic prediction, involves building a probabilistic graphical model Psi of the data, in the form of a random-access autoregressive sequence model. Psi supports a complete set of learned conditional distributions describing the dependence of any variables in the data on any other set of variables. In step 2, Structure extraction, we show how to extract underlying low-dimensional properties in the data, corresponding to a diverse set of meaningful "intermediate structures", in a zero-shot fashion via causal inference on Psi. Step 3, Integration, completes the cycle by converting these structures into new token types that are then continually mixed back into the training diet as conditioning signals and prediction targets. Each such cycle augments the capabilities of Psi, both allowing it to model the underlying data better, and creating new control handles -- akin to an LLM-like universal prompting language. We train an instance of Psi on 1.4 trillion tokens of internet video data; we use it to perform a variety of useful video prediction and understanding inferences; we extract state-of-the-art optical flow, self-supervised depth and object segmentation; and we use these structures to support a full cycle of predictive improvements.