PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding

Wei Chow, Jiageng Mao, Boyi Li, Daniel Seita, Vitor Guizilini, Yue Wang

2025-01-31

PhysBench: Benchmarking and Enhancing Vision-Language Models for Physical World Understanding

Summary

This paper talks about a new way to test and improve how well AI systems understand the physical world. The researchers created a tool called PhysBench to test AI models and developed a new system called PhysAgent to make these AIs better at understanding physics.

What's the problem?

AI systems that can understand images and language (called Vision-Language Models or VLMs) are getting really good at many tasks, but they're not great at understanding how the physical world works. This is a big problem because for AI to work well in the real world, like in robots, it needs to understand basic physics. Current AI models don't have this knowledge built in, which limits what they can do.

What's the solution?

The researchers created PhysBench, which is like a big test for AI with over 10,000 questions about the physical world. These questions use videos, images, and text to test the AI on things like object properties, how objects relate to each other, understanding physical scenes, and how things move. They tested 75 different AI models and found that while these AIs are smart in many ways, they struggle with physics. To fix this, they made PhysAgent, which combines the strengths of language-understanding AI with AI that's specially trained to understand images. This new system improved the AI's understanding of physics by 18.4% on one of the most advanced AI models.

Why it matters?

This matters because for AI to be truly useful in the real world, it needs to understand how physical things work. Imagine a robot trying to pick up a glass of water - it needs to know how fragile the glass is, how liquid moves, and how much force to use. By making AI better at understanding physics, we can create smarter robots and AI systems that can work more safely and effectively in the real world. This research is a big step towards making AI that can really understand and interact with the physical world around us.

Abstract

Understanding the physical world is a fundamental challenge in embodied AI, critical for enabling agents to perform complex tasks and operate safely in real-world environments. While Vision-Language Models (VLMs) have shown great promise in reasoning and task planning for embodied agents, their ability to comprehend physical phenomena remains extremely limited. To close this gap, we introduce PhysBench, a comprehensive benchmark designed to evaluate VLMs' physical world understanding capability across a diverse set of tasks. PhysBench contains 10,002 entries of interleaved video-image-text data, categorized into four major domains: physical object properties, physical object relationships, physical scene understanding, and physics-based dynamics, further divided into 19 subclasses and 8 distinct capability dimensions. Our extensive experiments, conducted on 75 representative VLMs, reveal that while these models excel in common-sense reasoning, they struggle with understanding the physical world -- likely due to the absence of physical knowledge in their training data and the lack of embedded physical priors. To tackle the shortfall, we introduce PhysAgent, a novel framework that combines the generalization strengths of VLMs with the specialized expertise of vision models, significantly enhancing VLMs' physical understanding across a variety of tasks, including an 18.4\% improvement on GPT-4o. Furthermore, our results demonstrate that enhancing VLMs' physical world understanding capabilities can help embodied agents such as MOKA. We believe that PhysBench and PhysAgent offer valuable insights and contribute to bridging the gap between VLMs and physical world understanding.

View Paper