Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

Chenglei Si, Diyi Yang, Tatsunori Hashimoto

2024-09-13

Can LLMs Generate Novel Research Ideas? A Large-Scale Human Study with 100+ NLP Researchers

Summary

This paper explores whether Large Language Models (LLMs) can create new and original research ideas by comparing them to expert human researchers in a study involving over 100 participants.

What's the problem?

Despite advancements in LLMs, there has been no solid evidence showing that these models can generate truly novel and expert-level research ideas. Existing evaluations have not effectively compared LLMs to human researchers in this context.

What's the solution?

The researchers designed a rigorous study where over 100 NLP experts generated research ideas, which were then compared to ideas generated by an LLM. The study involved blind reviews by expert judges who assessed the novelty and feasibility of both human and AI-generated ideas. The results showed that LLM-generated ideas were considered more novel than those from humans, although they were slightly less feasible.

Why it matters?

This research is significant because it suggests that AI can contribute valuable new ideas in scientific fields, potentially speeding up the process of discovery. However, it also highlights the need for human input to evaluate and refine these ideas, indicating a collaborative future between AI and human researchers.

Abstract

Recent advancements in large language models (LLMs) have sparked optimism about their potential to accelerate scientific discovery, with a growing number of works proposing research agents that autonomously generate and validate new ideas. Despite this, no evaluations have shown that LLM systems can take the very first step of producing novel, expert-level ideas, let alone perform the entire research process. We address this by establishing an experimental design that evaluates research idea generation while controlling for confounders and performs the first head-to-head comparison between expert NLP researchers and an LLM ideation agent. By recruiting over 100 NLP researchers to write novel ideas and blind reviews of both LLM and human ideas, we obtain the first statistically significant conclusion on current LLM capabilities for research ideation: we find LLM-generated ideas are judged as more novel (p < 0.05) than human expert ideas while being judged slightly weaker on feasibility. Studying our agent baselines closely, we identify open problems in building and evaluating research agents, including failures of LLM self-evaluation and their lack of diversity in generation. Finally, we acknowledge that human judgements of novelty can be difficult, even by experts, and propose an end-to-end study design which recruits researchers to execute these ideas into full projects, enabling us to study whether these novelty and feasibility judgements result in meaningful differences in research outcome.

View Paper