PHYBench: Holistic Evaluation of Physical Perception and Reasoning in Large Language Models

Shi Qiu, Shaoyang Guo, Zhuo-Yang Song, Yunbo Sun, Zeyu Cai, Jiashen Wei, Tianyu Luo, Yixuan Yin, Haoxu Zhang, Yi Hu, Chenyang Wang, Chencheng Tang, Haoling Chang, Qi Liu, Ziheng Zhou, Tianyu Zhang, Jingtian Zhang, Zhangyi Liu, Minghao Li, Yuku Zhang, Boxuan Jing, Xianqi Yin

2025-04-24

PHYBench: Holistic Evaluation of Physical Perception and Reasoning in
Large Language Models

Summary

This paper talks about PHYBench, a new way to test how well large language models can understand and solve physics problems, using a special scoring system that measures how close their answers are to the correct ones.

What's the problem?

The problem is that while large language models are getting better at answering questions, it's not clear how well they really understand physics, especially when it comes to both describing physical situations and solving equations like a human expert would.

What's the solution?

The researchers created PHYBench, which includes a variety of physics questions and uses a unique metric called Expression Edit Distance to see how similar the model's answers are to the correct ones. This allows for a more detailed and fair comparison between AI models and human experts.

Why it matters?

This is important because it shows where AI still struggles with real scientific reasoning, especially in a subject as challenging as physics. By highlighting these weaknesses, PHYBench helps guide future improvements so AI can become more accurate and trustworthy in scientific and educational settings.

Abstract

A new benchmark evaluates large language models on physics problems with a novel Expression Edit Distance metric, revealing gaps compared to human experts.

View Paper