Towards World Simulator: Crafting Physical Commonsense-Based Benchmark for Video Generation
Fanqing Meng, Jiaqi Liao, Xinyu Tan, Wenqi Shao, Quanfeng Lu, Kaipeng Zhang, Yu Cheng, Dianqi Li, Yu Qiao, Ping Luo
2024-10-10

Summary
This paper presents PhyGenBench, a new benchmark designed to evaluate how well text-to-video (T2V) models understand and apply physical commonsense in generating videos.
What's the problem?
While T2V models like Sora have made progress in visualizing complex prompts, they struggle to accurately represent intuitive physics—basic principles of how the physical world works. This lack of understanding limits their ability to create realistic videos that reflect real-world scenarios.
What's the solution?
To address this issue, the authors developed PhyGenBench, which includes 160 prompts based on 27 different physical laws across four main areas: mechanics, optics, thermal properties, and material properties. They also created an evaluation framework called PhyGenEval that uses advanced models to assess how well T2V systems understand these physical concepts. This framework allows for large-scale automated testing of the models' performance in generating videos that make sense physically.
Why it matters?
This research is important because it helps improve the capabilities of AI models in generating realistic videos. By focusing on physical commonsense, PhyGenBench and PhyGenEval encourage the development of more accurate and reliable AI systems that can be used in various applications, from entertainment to education and beyond. Understanding physics is crucial for creating believable simulations of the real world.
Abstract
Text-to-video (T2V) models like Sora have made significant strides in visualizing complex prompts, which is increasingly viewed as a promising path towards constructing the universal world simulator. Cognitive psychologists believe that the foundation for achieving this goal is the ability to understand intuitive physics. However, the capacity of these models to accurately represent intuitive physics remains largely unexplored. To bridge this gap, we introduce PhyGenBench, a comprehensive Physics Generation Benchmark designed to evaluate physical commonsense correctness in T2V generation. PhyGenBench comprises 160 carefully crafted prompts across 27 distinct physical laws, spanning four fundamental domains, which could comprehensively assesses models' understanding of physical commonsense. Alongside PhyGenBench, we propose a novel evaluation framework called PhyGenEval. This framework employs a hierarchical evaluation structure utilizing appropriate advanced vision-language models and large language models to assess physical commonsense. Through PhyGenBench and PhyGenEval, we can conduct large-scale automated assessments of T2V models' understanding of physical commonsense, which align closely with human feedback. Our evaluation results and in-depth analysis demonstrate that current models struggle to generate videos that comply with physical commonsense. Moreover, simply scaling up models or employing prompt engineering techniques is insufficient to fully address the challenges presented by PhyGenBench (e.g., dynamic scenarios). We hope this study will inspire the community to prioritize the learning of physical commonsense in these models beyond entertainment applications. We will release the data and codes at https://github.com/OpenGVLab/PhyGenBench