DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning

Xinrun Xu, Pi Bu, Ye Wang, Börje F. Karlsson, Ziming Wang, Tengtao Song, Qi Zhu, Jun Song, Zhiming Ding, Bo Zheng

2025-08-08

DeepPHY: Benchmarking Agentic VLMs on Physical Reasoning

Summary

This paper talks about DeepPHY, a test system that measures how well Vision Language Models can understand and reason about physical properties in images, like how objects move and interact in space, by using different simulated challenges.

What's the problem?

The problem is that current Vision Language Models have trouble understanding physical details and controlling actions based on images, which limits their ability to deal with real-world tasks that require knowing how objects behave and relate to each other.

What's the solution?

The solution was to create a benchmark called DeepPHY that puts these models through a series of tests in simulated environments with increasing difficulty, checking how well they can reason about physical laws and control movements accordingly.

Why it matters?

This matters because improving the physical reasoning and control of AI models helps make technologies like robots and autonomous systems smarter and more reliable in understanding and interacting with the real world.

Abstract

DeepPHY evaluates Vision Language Models' physical reasoning and control through simulated environments with varying difficulty levels.

View Paper