PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos

Meng Cao, Haoran Tang, Haoze Zhao, Hangyu Guo, Jiaheng Liu, Ge Zhang, Ruyang Liu, Qiang Sun, Ian Reid, Xiaodan Liang

2024-12-03

PhysGame: Uncovering Physical Commonsense Violations in Gameplay Videos

Summary

This paper introduces PhysGame, a benchmark designed to evaluate how well video models understand physical commonsense by analyzing glitches in gameplay videos.

What's the problem?

Many large video language models (LVLMs) struggle to accurately interpret physical events in videos, especially when these events defy common sense, like glitches in video games. There aren't enough good datasets available to test their understanding of these physical concepts, making it hard to know how well they perform in real-world situations.

What's the solution?

To address this issue, the researchers created PhysGame, which includes 880 gameplay videos showcasing various glitches related to physics across four key areas: mechanics, kinematics, optics, and material properties. They also developed additional datasets, PhysInstruct and PhysDPO, to help train these models in understanding physical commonsense better. By using these resources, they were able to create a new model called PhysVLM that significantly improves performance on tasks related to physical understanding.

Why it matters?

This research is important because it helps improve the ability of AI models to understand and interpret physical events in videos. By focusing on common sense violations in gameplay, PhysGame provides a unique way to evaluate and enhance the performance of video models, which can lead to better applications in fields like robotics, education, and entertainment where accurate understanding of physical interactions is crucial.

Abstract

Recent advancements in video-based large language models (Video LLMs) have witnessed the emergence of diverse capabilities to reason and interpret dynamic visual content. Among them, gameplay videos stand out as a distinctive data source, often containing glitches that defy physics commonsense. This characteristic renders them an effective benchmark for assessing the under-explored capability of physical commonsense understanding in video LLMs. In this paper, we propose PhysGame as a pioneering benchmark to evaluate physical commonsense violations in gameplay videos. PhysGame comprises 880 videos associated with glitches spanning four fundamental domains (i.e., mechanics, kinematics, optics, and material properties) and across 12 distinct physical commonsense. Through extensively evaluating various state-ofthe-art video LLMs, our findings reveal that the performance of current open-source video LLMs significantly lags behind that of proprietary counterparts. To bridge this gap, we curate an instruction tuning dataset PhysInstruct with 140,057 question-answering pairs to facilitate physical commonsense learning. In addition, we also propose a preference optimization dataset PhysDPO with 34,358 training pairs, where the dis-preferred responses are generated conditioned on misleading titles (i.e., meta information hacking), fewer frames (i.e., temporal hacking) and lower spatial resolutions (i.e., spatial hacking). Based on the suite of datasets, we propose PhysVLM as a physical knowledge-enhanced video LLM. Extensive experiments on both physical-oriented benchmark PhysGame and general video understanding benchmarks demonstrate the state-ofthe-art performance of PhysVLM.

View Paper