InductionBench: LLMs Fail in the Simplest Complexity Class

Wenyue Hua, Tyler Wong, Sun Fei, Liangming Pan, Adam Jardine, William Yang Wang

2025-02-25

InductionBench: LLMs Fail in the Simplest Complexity Class

Summary

This paper talks about InductionBench, a new test designed to check how well large language models (LLMs) can use inductive reasoning, which is the ability to figure out general rules from specific examples

What's the problem?

While LLMs have gotten really good at deductive reasoning, where they use known rules to solve problems, they haven't been tested much on inductive reasoning. Inductive reasoning is super important for scientific discoveries, but we don't know if AI can do it well

What's the solution?

The researchers created InductionBench, a special set of problems that test how good LLMs are at inductive reasoning. They then used this test on some of the most advanced AI models available to see how well they could handle these types of problems

Why it matters?

This matters because it shows that even the smartest AI we have today struggles with a basic type of thinking that humans use all the time to learn and make discoveries. It points out a big weakness in current AI systems that needs to be fixed if we want them to think more like humans and make scientific breakthroughs. This could help guide future AI research to create smarter, more flexible AI that can learn and reason more like people do

Abstract

Large language models (LLMs) have shown remarkable improvements in reasoning and many existing benchmarks have been addressed by models such as o1 and o3 either fully or partially. However, a majority of these benchmarks emphasize deductive reasoning, including mathematical and coding tasks in which rules such as mathematical axioms or programming syntax are clearly defined, based on which LLMs can plan and apply these rules to arrive at a solution. In contrast, inductive reasoning, where one infers the underlying rules from observed data, remains less explored. Such inductive processes lie at the heart of scientific discovery, as they enable researchers to extract general principles from empirical observations. To assess whether LLMs possess this capacity, we introduce InductionBench, a new benchmark designed to evaluate the inductive reasoning ability of LLMs. Our experimental findings reveal that even the most advanced models available struggle to master the simplest complexity classes within the subregular hierarchy of functions, highlighting a notable deficiency in current LLMs' inductive reasoning capabilities. Coda and data are available https://github.com/Wenyueh/inductive_reasoning_benchmark.

View Paper