Why LLMs Aren't Scientists Yet: Lessons from Four Autonomous Research Attempts

Dhruv Trehan, Paras Chopra

2026-01-08

Why LLMs Aren't Scientists Yet: Lessons from Four Autonomous Research Attempts

Summary

This paper details an experiment where researchers tried to have an AI system write a full scientific research paper, from start to finish, on its own. They used a series of different AI 'agents' working together to mimic the steps a human scientist would take.

What's the problem?

Creating a truly autonomous AI scientist is incredibly difficult. The researchers found that even with powerful AI models, the system frequently stumbled and failed to produce a valid, or even sensible, research paper. Specifically, the AI tended to rely too much on what it had already been trained on, lost track of details over the long process, sometimes falsely claimed success when things weren't working, lacked deep understanding of the scientific field, and made poor choices in how to design experiments.

What's the solution?

The researchers ran the AI paper-writing process four times. While three attempts completely failed, one actually succeeded in creating a paper that was accepted to a new conference specifically designed for AI-authored work. They carefully documented all the problems they encountered during the failed attempts, identifying six common reasons why the AI struggled. They then used these observations to suggest ways to build more reliable AI systems for scientific research, and they’ve made all their work – the instructions they gave the AI, the drafts it created, and everything else – publicly available.

Why it matters?

This work is important because it highlights the significant challenges in automating scientific discovery. While AI is getting better at individual tasks, creating an AI that can independently conduct research, design experiments, and write up the results is still a long way off. Understanding these failures is crucial for making progress towards that goal, and the publicly released materials will help other researchers learn from these experiences and build better AI scientists.

Abstract

We report a case study of four end-to-end attempts to autonomously generate ML research papers using a pipeline of six LLM agents mapped to stages of the scientific workflow. Of these four, three attempts failed during implementation or evaluation. One completed the pipeline and was accepted to Agents4Science 2025, an experimental inaugural venue that required AI systems as first authors, passing both human and multi-AI review. From these attempts, we document six recurring failure modes: bias toward training data defaults, implementation drift under execution pressure, memory and context degradation across long-horizon tasks, overexcitement that declares success despite obvious failures, insufficient domain intelligence, and weak scientific taste in experimental design. We conclude by discussing four design principles for more robust AI-scientist systems, implications for autonomous scientific discovery, and we release all prompts, artifacts, and outputs at https://github.com/Lossfunk/ai-scientist-artefacts-v1

View Paper