How Far is Video Generation from World Model: A Physical Law Perspective

Bingyi Kang, Yang Yue, Rui Lu, Zhijie Lin, Yang Zhao, Kaixin Wang, Gao Huang, Jiashi Feng

2024-11-05

How Far is Video Generation from World Model: A Physical Law Perspective

Summary

This paper explores the ability of video generation models to learn and apply physical laws, particularly focusing on how well these models can predict movements and interactions of objects based solely on visual data.

What's the problem?

Current video generation models, like OpenAI's Sora, show promise in creating realistic videos but struggle to understand and apply fundamental physical laws without human guidance. This raises questions about whether these models can accurately predict how objects should behave in various scenarios, especially when faced with new or unseen situations.

What's the solution?

The researchers developed a 2D simulation environment to test the models' abilities to generate videos that follow classical mechanics laws, such as motion and collisions. They trained diffusion-based video generation models to predict object movements from initial frames and evaluated their performance across different scenarios: familiar situations (in-distribution), unfamiliar situations (out-of-distribution), and combinations of tasks (combinatorial generalization). Their findings indicated that while the models performed well in familiar scenarios, they struggled significantly with unfamiliar ones, often relying on specific examples rather than understanding broader physical principles.

Why it matters?

This research is important because it highlights the limitations of current AI models in understanding real-world physics. By investigating how these models learn and generalize from visual data, the study suggests that simply increasing data or model size is not enough for them to grasp fundamental physical laws. This insight is crucial for developing more advanced AI systems that can accurately simulate and predict real-world phenomena, which has implications for fields like robotics, animation, and virtual reality.

Abstract

OpenAI's Sora highlights the potential of video generation for developing world models that adhere to fundamental physical laws. However, the ability of video generation models to discover such laws purely from visual data without human priors can be questioned. A world model learning the true law should give predictions robust to nuances and correctly extrapolate on unseen scenarios. In this work, we evaluate across three key scenarios: in-distribution, out-of-distribution, and combinatorial generalization. We developed a 2D simulation testbed for object movement and collisions to generate videos deterministically governed by one or more classical mechanics laws. This provides an unlimited supply of data for large-scale experimentation and enables quantitative evaluation of whether the generated videos adhere to physical laws. We trained diffusion-based video generation models to predict object movements based on initial frames. Our scaling experiments show perfect generalization within the distribution, measurable scaling behavior for combinatorial generalization, but failure in out-of-distribution scenarios. Further experiments reveal two key insights about the generalization mechanisms of these models: (1) the models fail to abstract general physical rules and instead exhibit "case-based" generalization behavior, i.e., mimicking the closest training example; (2) when generalizing to new cases, models are observed to prioritize different factors when referencing training data: color > size > velocity > shape. Our study suggests that scaling alone is insufficient for video generation models to uncover fundamental physical laws, despite its role in Sora's broader success. See our project page at https://phyworld.github.io

View Paper