Olaf-World: Orienting Latent Actions for Video World Modeling
Yuxin Jiang, Yuchao Gu, Ivor W. Tsang, Mike Zheng Shou
2026-02-11
Summary
This paper tackles the challenge of teaching computers to understand and perform actions in videos, even when we don't explicitly tell them what those actions are. It focuses on building 'world models' – essentially, a computer's internal representation of how the world works – that can be controlled by actions.
What's the problem?
Currently, training these action-controllable world models requires a lot of labeled data showing exactly what action is happening at each moment in a video. A promising alternative, 'latent action learning,' tries to figure out the actions just by watching unlabeled videos, but these learned actions often don't work well in new situations. This is because the computer learns to associate actions with specific details of the video, like the background or lighting, instead of the action itself, and it doesn't have a consistent way to understand what different actions *mean*.
What's the solution?
The researchers realized that even if we can't see the actions directly, we can observe their *effects* on the video. They developed a new technique called SeqΔ-REPA that focuses on aligning the learned actions with the changes in the video that result from those actions. It does this by comparing video frames before and after an action, using a pre-trained video understanding system to identify those changes. They also created a system called Olaf-World to pre-train these action-conditioned world models using large amounts of unlabeled video.
Why it matters?
This work is important because it allows computers to learn to control virtual worlds and potentially real-world robots with much less labeled data. By focusing on the *effects* of actions, the system learns more general and transferable skills, meaning it can adapt to new environments and control interfaces more easily than previous methods. This is a step towards creating more intelligent and adaptable AI systems.
Abstract
Scaling action-controllable world models is limited by the scarcity of action labels. While latent action learning promises to extract control interfaces from unlabeled video, learned latents often fail to transfer across contexts: they entangle scene-specific cues and lack a shared coordinate system. This occurs because standard objectives operate only within each clip, providing no mechanism to align action semantics across contexts. Our key insight is that although actions are unobserved, their semantic effects are observable and can serve as a shared reference. We introduce SeqΔ-REPA, a sequence-level control-effect alignment objective that anchors integrated latent action to temporal feature differences from a frozen, self-supervised video encoder. Building on this, we present Olaf-World, a pipeline that pretrains action-conditioned video world models from large-scale passive video. Extensive experiments demonstrate that our method learns a more structured latent action space, leading to stronger zero-shot action transfer and more data-efficient adaptation to new control interfaces than state-of-the-art baselines.