InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

Shuofei Qiao, Yunxiang Wei, Xuehai Wang, Bin Wu, Boyang Xue, Ningyu Zhang, Hossein A. Rahmani, Yanshan Wang, Qiang Zhang, Keyan Ding, Jeff Z. Pan, Huajun Chen, Emine Yilmaz

2026-02-17

InnoEval: On Research Idea Evaluation as a Knowledge-Grounded, Multi-Perspective Reasoning Problem

Summary

This paper introduces a new system called InnoEval designed to better evaluate scientific ideas generated by Large Language Models (LLMs). While LLMs are getting really good at *creating* ideas, we haven't kept pace with figuring out which ones are actually good and worth pursuing.

What's the problem?

Currently, evaluating ideas from LLMs is tricky because existing methods often lack depth. They don't always consider enough background knowledge, look at ideas from multiple angles, or avoid being biased by simply letting another LLM judge the first one. Essentially, current methods aren't as thorough or reliable as a human expert would be when deciding if an idea is innovative and promising.

What's the solution?

The researchers tackled this by building InnoEval, which tries to mimic how humans evaluate ideas. It works in two main ways: first, it uses a powerful search engine to gather relevant information from the internet to provide context for the idea. Second, it creates a 'review board' of simulated reviewers with different areas of expertise to get multiple perspectives on the idea, and then combines their evaluations. They also created datasets of real scientific submissions to test how well InnoEval performs.

Why it matters?

This work is important because as LLMs become more involved in scientific discovery, we need reliable ways to assess the quality of the ideas they generate. InnoEval offers a significant improvement over existing methods, getting closer to human-level judgment and helping us identify truly innovative ideas that deserve further investigation. This could speed up the pace of scientific progress.

Abstract

The rapid evolution of Large Language Models has catalyzed a surge in scientific idea production, yet this leap has not been accompanied by a matching advance in idea evaluation. The fundamental nature of scientific evaluation needs knowledgeable grounding, collective deliberation, and multi-criteria decision-making. However, existing idea evaluation methods often suffer from narrow knowledge horizons, flattened evaluation dimensions, and the inherent bias in LLM-as-a-Judge. To address these, we regard idea evaluation as a knowledge-grounded, multi-perspective reasoning problem and introduce InnoEval, a deep innovation evaluation framework designed to emulate human-level idea assessment. We apply a heterogeneous deep knowledge search engine that retrieves and grounds dynamic evidence from diverse online sources. We further achieve review consensus with an innovation review board containing reviewers with distinct academic backgrounds, enabling a multi-dimensional decoupled evaluation across multiple metrics. We construct comprehensive datasets derived from authoritative peer-reviewed submissions to benchmark InnoEval. Experiments demonstrate that InnoEval can consistently outperform baselines in point-wise, pair-wise, and group-wise evaluation tasks, exhibiting judgment patterns and consensus highly aligned with human experts.

View Paper