PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning

Xinyu Zhang, Yuxuan Dong, Yanrui Wu, Jiaxing Huang, Chengyou Jia, Basura Fernando, Mike Zheng Shou, Lingling Zhang, Jun Liu

2025-02-18

PhysReason: A Comprehensive Benchmark towards Physics-Based Reasoning

Summary

This paper talks about Dyve, a new AI system that improves how large language models (LLMs) detect errors in reasoning by combining two types of thinking: fast, instinctive decisions and slow, careful analysis. It is inspired by how humans use both quick judgments and detailed reasoning to solve problems.

What's the problem?

Current AI systems often struggle with detecting errors in complex reasoning tasks because they rely on a single approach to thinking. This makes them less effective at handling both simple and complicated steps in processes, leading to mistakes in areas like verifying workflows or solving advanced math problems.

What's the solution?

The researchers created Dyve, which uses a dual-process system inspired by Daniel Kahneman's theory of fast and slow thinking. For easy tasks, Dyve uses a quick, instinctive method (System 1), while for harder tasks, it switches to a more detailed and deliberate approach (System 2). They also introduced a new training method that filters out noisy data and uses techniques like Monte Carlo estimation to improve accuracy. Tests on benchmarks like ProcessBench and the MATH dataset showed that Dyve performs significantly better than existing systems at detecting reasoning errors.

Why it matters?

This matters because Dyve could make AI systems much more reliable for tasks that involve complex reasoning, such as verifying business processes or solving mathematical problems. By combining fast and slow thinking, it mimics human problem-solving more closely, paving the way for smarter and more adaptable AI systems.

Abstract

Large language models demonstrate remarkable capabilities across various domains, especially mathematics and logic reasoning. However, current evaluations overlook physics-based reasoning - a complex task requiring physics theorems and constraints. We present PhysReason, a 1,200-problem benchmark comprising knowledge-based (25%) and reasoning-based (75%) problems, where the latter are divided into three difficulty levels (easy, medium, hard). Notably, problems require an average of 8.1 solution steps, with hard requiring 15.6, reflecting the complexity of physics-based reasoning. We propose the Physics Solution Auto Scoring Framework, incorporating efficient answer-level and comprehensive step-level evaluations. Top-performing models like Deepseek-R1, Gemini-2.0-Flash-Thinking, and o3-mini-high achieve less than 60% on answer-level evaluation, with performance dropping from knowledge questions (75.11%) to hard problems (31.95%). Through step-level evaluation, we identified four key bottlenecks: Physics Theorem Application, Physics Process Understanding, Calculation, and Physics Condition Analysis. These findings position PhysReason as a novel and comprehensive benchmark for evaluating physics-based reasoning capabilities in large language models. Our code and data will be published at https:/dxzxy12138.github.io/PhysReason.

View Paper