InnoGym: Benchmarking the Innovation Potential of AI Agents
Jintian Zhang, Kewei Xu, Jingsheng Zheng, Zhuoyun Yu, Yuqi Zhu, Yujie Luo, Lanning Wei, Shuofei Qiao, Lun Du, Da Zheng, Shumin Deng, Huajun Chen, Ningyu Zhang
2025-12-03
Summary
This paper introduces InnoGym, a new way to test AI agents, focusing on how *creative* they are, not just if they get the right answer.
What's the problem?
Current tests for AI, especially those dealing with things like coding or science, mostly check if the AI is correct. They don't look at *how* the AI solved the problem. Real progress in AI isn't just about finding an answer, it's about finding *new* and better ways to solve problems, and existing benchmarks don't measure that originality.
What's the solution?
The researchers created InnoGym, which includes 18 challenging tasks from real-world fields like engineering and science. They developed two ways to score AI agents: 'performance gain,' which measures how much better the AI does compared to existing solutions, and 'novelty,' which measures how different the AI's approach is from what's already been done. They also built a standardized testing environment called iGym to make sure the tests are fair and repeatable.
Why it matters?
The results showed that while some AIs can come up with new ideas, they aren't always reliable or effective. This highlights that being creative isn't enough – AI needs to be both innovative *and* consistently good at solving problems. This work emphasizes the need for better testing methods that evaluate both aspects of AI intelligence.
Abstract
LLMs and Agents have achieved impressive progress in code generation, mathematical reasoning, and scientific discovery. However, existing benchmarks primarily measure correctness, overlooking the diversity of methods behind solutions. True innovation depends not only on producing correct answers but also on the originality of the approach. We present InnoGym, the first benchmark and framework designed to systematically evaluate the innovation potential of AI agents. InnoGym introduces two complementary metrics: performance gain, which measures improvement over the best-known solutions, and novelty, which captures methodological differences from prior approaches. The benchmark includes 18 carefully curated tasks from real-world engineering and scientific domains, each standardized through resource filtering, evaluator validation, and solution collection. In addition, we provide iGym, a unified execution environment for reproducible and long-horizon evaluations. Extensive experiments show that while some agents produce novel approaches, their lack of robustness limits performance gains. These results highlight a key gap between creativity and effectiveness, underscoring the need for benchmarks that evaluate both.