QuantiPhy: A Quantitative Benchmark Evaluating Physical Reasoning Abilities of Vision-Language Models
Li Puyin, Tiange Xiang, Ella Mao, Shirley Wei, Xinye Chen, Adnan Masood, Li Fei-fei, Ehsan Adeli
2025-12-24
Summary
This paper investigates whether current AI models, specifically those that combine vision and language, actually *understand* the physical world in a measurable way, or if they just sound like they do.
What's the problem?
Currently, we evaluate these AI models by asking them questions about videos, which only tells us if their answers *seem* reasonable. It doesn't tell us if they can accurately calculate things like an object's speed, size, or acceleration. There's a lack of a good way to test if these models can truly reason about physical properties using numbers, not just words.
What's the solution?
The researchers created a new benchmark called QuantiPhy. This benchmark includes over 3,300 videos with corresponding questions that require the AI to give a numerical answer – like 'What is the object's velocity at this time?'. They standardized the way questions are asked and how the answers are scored to make sure it's a fair test. They then tested several state-of-the-art AI models on this benchmark.
Why it matters?
This work is important because it shows that even though these AI models can often *seem* to understand physics, they frequently get the actual numbers wrong. The research also reveals that these models rely more on pre-existing knowledge than on carefully analyzing the video and text provided, hindering their ability to truly understand the physical world in a quantifiable way. QuantiPhy provides a tool to push AI development towards more accurate and grounded physical reasoning.
Abstract
Understanding the physical world is essential for generalist AI agents. However, it remains unclear whether state-of-the-art vision perception models (e.g., large VLMs) can reason physical properties quantitatively. Existing evaluations are predominantly VQA-based and qualitative, offering limited insight into whether these models can infer the kinematic quantities of moving objects from video observations. To address this, we present QuantiPhy, the first benchmark designed to quantitatively measure a VLM's physical reasoning ability. Comprising more than 3.3K video-text instances with numerical ground truth, QuantiPhy evaluates a VLM's performance on estimating an object's size, velocity, and acceleration at a given timestamp, using one of these properties as an input prior. The benchmark standardizes prompts and scoring to assess numerical accuracy, enabling fair comparisons across models. Our experiments on state-of-the-art VLMs reveal a consistent gap between their qualitative plausibility and actual numerical correctness. We further provide an in-depth analysis of key factors like background noise, counterfactual priors, and strategic prompting and find that state-of-the-art VLMs lean heavily on pre-trained world knowledge rather than faithfully using the provided visual and textual inputs as references when reasoning kinematic properties quantitatively. QuantiPhy offers the first rigorous, scalable testbed to move VLMs beyond mere verbal plausibility toward a numerically grounded physical understanding.