Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

Wanghan Xu, Yuhao Zhou, Yifan Zhou, Qinglong Cao, Shuo Li, Jia Bu, Bo Liu, Yixin Chen, Xuming He, Xiangyu Zhao, Xiang Zhuang, Fengxiang Wang, Zhiwang Zhou, Qiantai Feng, Wenxuan Huang, Jiaqi Wei, Hao Wu, Yuejin Yang, Guangshuai Wang, Sheng Xu, Ziyan Huang, Xinyao Liu

2025-12-22

Probing Scientific General Intelligence of LLMs with Scientist-Aligned Workflows

Summary

This paper focuses on the challenge of creating AI that can truly *do* science, not just process information about it. They define what 'Scientific General Intelligence' (SGI) would look like and then test current AI models to see how close they are to achieving it.

What's the problem?

Current AI, even advanced systems, struggles with the full scientific process. They can perform tasks *within* science, like analyzing data, but can't independently come up with research questions, design experiments, and interpret results in a truly novel way. There's no clear standard for measuring how well an AI can act as a scientist, and existing benchmarks don't fully capture the complexity of scientific work.

What's the solution?

The researchers created a definition of SGI based on how scientists actually work – thinking, forming ideas, taking action (like running experiments), and observing the results. They then built a challenging test, called SGI-Bench, with over 1,000 problems inspired by big questions in science. They tested current large language models (LLMs) on these tasks, looking at things like research skills, idea generation, and the ability to design and reason about experiments. They also introduced a new technique, Test-Time Reinforcement Learning, to help AI come up with more original ideas during testing.

Why it matters?

This work is important because it provides a roadmap for building AI that can genuinely contribute to scientific discovery. By defining SGI and creating a rigorous benchmark, the researchers are helping to focus development efforts and measure progress. The new technique for improving idea generation could lead to AI systems that can help scientists tackle some of the biggest challenges facing humanity.

Abstract

Despite advances in scientific AI, a coherent framework for Scientific General Intelligence (SGI)-the ability to autonomously conceive, investigate, and reason across scientific domains-remains lacking. We present an operational SGI definition grounded in the Practical Inquiry Model (PIM: Deliberation, Conception, Action, Perception) and operationalize it via four scientist-aligned tasks: deep research, idea generation, dry/wet experiments, and experimental reasoning. SGI-Bench comprises over 1,000 expert-curated, cross-disciplinary samples inspired by Science's 125 Big Questions, enabling systematic evaluation of state-of-the-art LLMs. Results reveal gaps: low exact match (10--20%) in deep research despite step-level alignment; ideas lacking feasibility and detail; high code executability but low execution result accuracy in dry experiments; low sequence fidelity in wet protocols; and persistent multimodal comparative-reasoning challenges. We further introduce Test-Time Reinforcement Learning (TTRL), which optimizes retrieval-augmented novelty rewards at inference, enhancing hypothesis novelty without reference answer. Together, our PIM-grounded definition, workflow-centric benchmark, and empirical insights establish a foundation for AI systems that genuinely participate in scientific discovery.

View Paper