CMPhysBench: A Benchmark for Evaluating Large Language Models in Condensed Matter Physics
Weida Wang, Dongchen Huang, Jiatong Li, Tengchao Yang, Ziyang Zheng, Di Zhang, Dong Han, Benteng Chen, Binzhao Luo, Zhiyu Liu, Kunling Liu, Zhiyuan Gao, Shiqi Geng, Wei Ma, Jiaming Su, Xin Li, Shuchen Pu, Yuhan Shui, Qianjia Cheng, Zhihao Dou, Dongfei Cui, Changyong He
2025-08-27
Summary
This paper introduces CMPhysBench, a new way to test how well large language models (LLMs) – like advanced AI chatbots – understand and can *solve* problems in condensed matter physics, which is a complex area of physics dealing with solids and liquids.
What's the problem?
Current methods for evaluating LLMs often just check if they get the final answer right or wrong. This doesn't show if the AI actually *understands* the physics involved, or if it just guessed correctly. Existing physics benchmarks aren't challenging enough for these advanced models, and don't accurately measure their ability to perform calculations, a crucial skill in this field.
What's the solution?
The researchers created CMPhysBench, a collection of over 520 challenging, graduate-level physics problems that require step-by-step calculations. Importantly, they developed a new scoring system called SEED (Scalable Expression Edit Distance) that doesn't just give a problem a 'right' or 'wrong' score. Instead, SEED gives partial credit based on how similar the AI’s *process* is to the correct solution, even if the final answer isn't perfect. This allows for a more nuanced evaluation of the AI’s understanding.
Why it matters?
The results show that even the most powerful LLMs struggle with these physics problems, achieving relatively low scores. This highlights a significant gap in the ability of current AI to handle complex, real-world scientific tasks. Developing better AI tools for physics could accelerate research and discovery, and this benchmark provides a valuable tool for tracking progress in that area.
Abstract
We introduce CMPhysBench, designed to assess the proficiency of Large Language Models (LLMs) in Condensed Matter Physics, as a novel Benchmark. CMPhysBench is composed of more than 520 graduate-level meticulously curated questions covering both representative subfields and foundational theoretical frameworks of condensed matter physics, such as magnetism, superconductivity, strongly correlated systems, etc. To ensure a deep understanding of the problem-solving process,we focus exclusively on calculation problems, requiring LLMs to independently generate comprehensive solutions. Meanwhile, leveraging tree-based representations of expressions, we introduce the Scalable Expression Edit Distance (SEED) score, which provides fine-grained (non-binary) partial credit and yields a more accurate assessment of similarity between prediction and ground-truth. Our results show that even the best models, Grok-4, reach only 36 average SEED score and 28% accuracy on CMPhysBench, underscoring a significant capability gap, especially for this practical and frontier domain relative to traditional physics. The code anddataset are publicly available at https://github.com/CMPhysBench/CMPhysBench.