DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Gaoyue Zhou, Hengkai Pan, Yann LeCun, Lerrel Pinto

2025-02-03

DINO-WM: World Models on Pre-trained Visual Features enable Zero-shot Planning

Summary

This paper talks about a new way to make AI systems better at predicting and planning actions in different situations. The researchers created something called DINO-WM, which is like a smart brain for robots that can learn from watching videos and figure out how to do new tasks without being specifically taught.

What's the problem?

Current AI systems that try to understand and predict the world (called world models) are usually designed for specific tasks and need constant online learning. This means they're not very flexible and can't easily handle new situations. It's like having a robot that's great at making coffee but can't figure out how to make tea without being completely retrained.

What's the solution?

The researchers developed DINO-WM, which stands for DINO World Model. This system uses a clever trick: it looks at images in a special way, breaking them down into little patches and predicting how these patches will change over time. By doing this, DINO-WM can learn from watching videos of actions and then figure out how to do new tasks on its own. It's like giving the robot a super-powered imagination that can visualize how its actions will play out before it does them.

Why it matters?

This matters because it could make robots and AI systems much more adaptable and useful in the real world. Instead of needing to be programmed for every single task, they could learn to handle new situations on their own. This could lead to robots that can help in more flexible ways in homes, factories, or even in dangerous situations where humans can't go. It's a big step towards making AI that can think and plan more like humans do, which could open up many new possibilities for how we use technology in our daily lives.

Abstract

The ability to predict future outcomes given control actions is fundamental for physical reasoning. However, such predictive models, often called world models, have proven challenging to learn and are typically developed for task-specific solutions with online policy learning. We argue that the true potential of world models lies in their ability to reason and plan across diverse problems using only passive data. Concretely, we require world models to have the following three properties: 1) be trainable on offline, pre-collected trajectories, 2) support test-time behavior optimization, and 3) facilitate task-agnostic reasoning. To realize this, we present DINO World Model (DINO-WM), a new method to model visual dynamics without reconstructing the visual world. DINO-WM leverages spatial patch features pre-trained with DINOv2, enabling it to learn from offline behavioral trajectories by predicting future patch features. This design allows DINO-WM to achieve observational goals through action sequence optimization, facilitating task-agnostic behavior planning by treating desired goal patch features as prediction targets. We evaluate DINO-WM across various domains, including maze navigation, tabletop pushing, and particle manipulation. Our experiments demonstrate that DINO-WM can generate zero-shot behavioral solutions at test time without relying on expert demonstrations, reward modeling, or pre-learned inverse models. Notably, DINO-WM exhibits strong generalization capabilities compared to prior state-of-the-art work, adapting to diverse task families such as arbitrarily configured mazes, push manipulation with varied object shapes, and multi-particle scenarios.

View Paper