TRAVL: A Recipe for Making Video-Language Models Better Judges of Physics Implausibility
Saman Motamed, Minghao Chen, Luc Van Gool, Iro Laina
2025-10-09
Summary
This research focuses on the issue of videos created by artificial intelligence often looking realistic but containing impossible physical events, like objects defying gravity or appearing in the wrong place. The paper explores whether AI models that understand both video and language can be taught to identify these unrealistic moments.
What's the problem?
Current video generation AI is good at making things *look* real, but it frequently fails to create videos that *are* physically plausible. For example, you might see a video of a ball floating in mid-air. While humans easily spot these errors, there wasn't a good way to automatically measure how physically realistic a video is, and existing AI models weren't very good at detecting these issues because they struggle with understanding how things move and cause events over time.
What's the solution?
The researchers developed a method called TRAVL, which improves an AI's ability to judge physical realism. They did this by showing the AI a lot of examples of both realistic and unrealistic videos, and also by adding a special component that helps the AI pay closer attention to how objects move within the video. They also created a new set of 300 videos, called ImplausiBench, specifically designed to test this physical understanding, removing any clues from the video's description that might give the AI an unfair advantage.
Why it matters?
This work is important because as AI-generated videos become more common, it's crucial to ensure they are believable and don't depict impossible scenarios. Being able to automatically assess physical realism helps us build better AI models that can create more convincing and trustworthy videos, and it provides a way to measure progress in AI's understanding of the physical world.
Abstract
Despite impressive visual fidelity, modern video generative models frequently produce sequences that violate intuitive physical laws, such as objects floating, teleporting, or morphing in ways that defy causality. While humans can easily detect such implausibilities, there remains no robust method for quantitatively assessing physical realism in video. In this work, we explore whether Video-Language Models (VLMs) can be trained to serve as reliable judges of physical plausibility. We find that existing VLMs struggle to identify physics violations, exposing fundamental limitations in their temporal and causal reasoning. To address this, we introduce TRAVL, a fine-tuning recipe that combines a balanced training dataset with a trajectory-aware attention module to improve motion encoding and discrimination in VLMs. To evaluate physical reasoning more rigorously, we propose ImplausiBench, a benchmark of 300 videos (150 real, 150 generated) that removes linguistic biases and isolates visual-temporal understanding. Performance is reported both with gold-standard human judgments and stricter LLM-as-judge metrics. Together, TRAVL and ImplausiBench offer a unified framework for probing and improving physical plausibility in multimodal models, shedding light on a challenging and underexplored aspect of visual-temporal understanding.