LAB-Bench: Measuring Capabilities of Language Models for Biology Research

Jon M. Laurent, Joseph D. Janizek, Michael Ruzo, Michaela M. Hinks, Michael J. Hammerling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D. White, Samuel G. Rodriques

2024-07-16

LAB-Bench: Measuring Capabilities of Language Models for Biology Research

Summary

This paper discusses LAB-Bench, a new benchmark designed to evaluate how well large language models (LLMs) can perform practical tasks in biology research.

What's the problem?

While many benchmarks exist to test LLMs on textbook-style science questions, there are very few that assess their abilities to handle real-world tasks that biologists actually perform, such as searching literature, planning experiments, and analyzing data. This gap makes it hard to know how effective these models are in supporting scientific research.

What's the solution?

To fill this gap, the authors created LAB-Bench, which includes over 2,400 multiple-choice questions focused on various practical biology tasks. These tasks test the models' abilities to recall information from scientific literature, interpret figures and data, navigate databases, and understand DNA and protein sequences. They also compared the performance of several LLMs against this benchmark and found that while some models did well in certain areas, they struggled with more complex tasks that require deeper understanding.

Why it matters?

This research is important because it helps identify how well AI models can assist scientists in their work. By establishing a benchmark like LAB-Bench, researchers can better understand the strengths and weaknesses of LLMs in biology, leading to improvements in how these models are developed and used in scientific research. This could ultimately accelerate discoveries in biology and other related fields.

Abstract

There is widespread optimism that frontier Large Language Models (LLMs) and LLM-augmented systems have the potential to rapidly accelerate scientific discovery across disciplines. Today, many benchmarks exist to measure LLM knowledge and reasoning on textbook-style science questions, but few if any benchmarks are designed to evaluate language model performance on practical tasks required for scientific research, such as literature search, protocol planning, and data analysis. As a step toward building such benchmarks, we introduce the Language Agent Biology Benchmark (LAB-Bench), a broad dataset of over 2,400 multiple choice questions for evaluating AI systems on a range of practical biology research capabilities, including recall and reasoning over literature, interpretation of figures, access and navigation of databases, and comprehension and manipulation of DNA and protein sequences. Importantly, in contrast to previous scientific benchmarks, we expect that an AI system that can achieve consistently high scores on the more difficult LAB-Bench tasks would serve as a useful assistant for researchers in areas such as literature search and molecular cloning. As an initial assessment of the emergent scientific task capabilities of frontier language models, we measure performance of several against our benchmark and report results compared to human expert biology researchers. We will continue to update and expand LAB-Bench over time, and expect it to serve as a useful tool in the development of automated research systems going forward. A public subset of LAB-Bench is available for use at the following URL: https://huggingface.co/datasets/futurehouse/lab-bench

View Paper