NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

Tianshi Zheng, Kelvin Kiu-Wai Tam, Newt Hue-Nam K. Nguyen, Baixuan Xu, Zhaowei Wang, Jiayang Cheng, Hong Ting Tsang, Weiqi Wang, Jiaxin Bai, Tianqing Fang, Yangqiu Song, Ginny Y. Wong, Simon See

2025-10-10

NewtonBench: Benchmarking Generalizable Scientific Law Discovery in LLM Agents

Summary

This paper introduces a new way to test how well large language models (LLMs) can discover scientific laws, like how gravity works. It highlights problems with current testing methods and proposes a more realistic and challenging benchmark called NewtonBench.

What's the problem?

Currently, testing LLMs for scientific discovery is tricky because it's hard to create tests that are both scientifically meaningful, large enough to be useful, and prevent the LLM from simply memorizing the answers. Existing tests often treat discovery as just finding a mathematical equation that fits data, which isn't how real science works – scientists actually *explore* systems to find the underlying rules. They need to actively experiment and gather data.

What's the solution?

The researchers created NewtonBench, a set of 324 different scientific problems across 12 areas of physics. These problems aren't just about finding a formula; they require the LLM to act like a scientist, designing experiments on simulated systems to uncover hidden laws. They created these problems by slightly changing well-known laws of physics, making them hard to memorize but still based on real science. They then tested how well current LLMs could solve these problems, and even looked at whether giving the LLM tools like a code interpreter helped or hurt its performance.

Why it matters?

This work is important because it shows that while LLMs have *some* ability to discover scientific laws, it's still quite limited, especially when dealing with complex systems or noisy data. Surprisingly, giving LLMs tools didn't always help – sometimes it led them to settle for easier, but less accurate, solutions. NewtonBench provides a better way to measure progress in this field and guide the development of AI that can truly contribute to scientific discovery.

Abstract

Large language models are emerging as powerful tools for scientific law discovery, a foundational challenge in AI-driven science. However, existing benchmarks for this task suffer from a fundamental methodological trilemma, forcing a trade-off between scientific relevance, scalability, and resistance to memorization. Furthermore, they oversimplify discovery as static function fitting, failing to capture the authentic scientific process of uncovering embedded laws through the interactive exploration of complex model systems. To address these critical gaps, we introduce NewtonBench, a benchmark comprising 324 scientific law discovery tasks across 12 physics domains. Our design mitigates the evaluation trilemma by using metaphysical shifts - systematic alterations of canonical laws - to generate a vast suite of problems that are scalable, scientifically relevant, and memorization-resistant. Moreover, we elevate the evaluation from static function fitting to interactive model discovery, requiring agents to experimentally probe simulated complex systems to uncover hidden principles. Our extensive experiment reveals a clear but fragile capability for discovery in frontier LLMs: this ability degrades precipitously with increasing system complexity and exhibits extreme sensitivity to observational noise. Notably, we uncover a paradoxical effect of tool assistance: providing a code interpreter can hinder more capable models by inducing a premature shift from exploration to exploitation, causing them to satisfice on suboptimal solutions. These results demonstrate that robust, generalizable discovery in complex, interactive environments remains the core challenge. By providing a scalable, robust, and scientifically authentic testbed, NewtonBench offers a crucial tool for measuring true progress and guiding the development of next-generation AI agents capable of genuine scientific discovery.

View Paper