Intuitive physics understanding emerges from self-supervised pretraining on natural videos

Quentin Garrido, Nicolas Ballas, Mahmoud Assran, Adrien Bardes, Laurent Najman, Michael Rabbat, Emmanuel Dupoux, Yann LeCun

2025-02-18

Intuitive physics understanding emerges from self-supervised pretraining
on natural videos

Summary

This paper talks about how AI can develop an understanding of basic physics, like how objects move and interact, by watching videos and predicting what happens next. The researchers used a special AI model called V-JEPA to study this.

What's the problem?

AI models often struggle to understand the basic rules of physics, like knowing that objects don’t just disappear or pass through each other. This limits their ability to make sense of the real world in tasks like video analysis or robotics.

What's the solution?

The researchers trained the V-JEPA model to predict missing parts of videos in a way that focuses on understanding the overall structure and motion in the scene. By doing this, the AI learned important physics concepts like object permanence and shape consistency. They tested the model by showing it videos with normal physics and others with impossible scenarios, like objects vanishing, and found that V-JEPA could detect these violations much better than other AI models.

Why it matters?

This matters because it shows that AI can learn basic physics just by watching videos, without needing special programming. This could lead to smarter AI systems that understand the physical world better, which would be useful for things like robotics, video analysis, or even creating more realistic animations.

Abstract

We investigate the emergence of intuitive physics understanding in general-purpose deep neural network models trained to predict masked regions in natural videos. Leveraging the violation-of-expectation framework, we find that video prediction models trained to predict outcomes in a learned representation space demonstrate an understanding of various intuitive physics properties, such as object permanence and shape consistency. In contrast, video prediction in pixel space and multimodal large language models, which reason through text, achieve performance closer to chance. Our comparisons of these architectures reveal that jointly learning an abstract representation space while predicting missing parts of sensory input, akin to predictive coding, is sufficient to acquire an understanding of intuitive physics, and that even models trained on one week of unique video achieve above chance performance. This challenges the idea that core knowledge -- a set of innate systems to help understand the world -- needs to be hardwired to develop an understanding of intuitive physics.

View Paper