Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Mike A. Merrill, Alexander G. Shaw, Nicholas Carlini, Boxuan Li, Harsh Raj, Ivan Bercovich, Lin Shi, Jeong Yeon Shin, Thomas Walshe, E. Kelly Buchanan, Junhong Shen, Guanghao Ye, Haowei Lin, Jason Poulos, Maoyu Wang, Marianna Nezhurina, Jenia Jitsev, Di Lu, Orfeas Menis Mastromichalakis, Zhiwei Xu, Zizhao Chen, Yue Liu

2026-01-23

Terminal-Bench: Benchmarking Agents on Hard, Realistic Tasks in Command Line Interfaces

Summary

This paper introduces a new, challenging set of tests called Terminal-Bench 2.0 designed to evaluate how well AI agents can perform realistic tasks that require planning and problem-solving over a longer period of time.

What's the problem?

Currently, the ways we test AI aren't very good at measuring how well they can handle complex, real-world jobs. Existing tests are either too simple or don't reflect the kinds of things people actually need AI to do. This makes it hard to know if the latest AI models are truly improving in their ability to tackle difficult, multi-step problems.

What's the solution?

The researchers created Terminal-Bench 2.0, which includes 89 different tasks that mimic problems you'd find in a computer terminal – things like managing files, using command-line tools, and completing workflows. Each task has a correct solution written by a person and automated tests to check if the AI's answer is right. They then tested current AI models on these tasks and analyzed where the AI struggled.

Why it matters?

This work is important because it provides a more realistic and difficult way to measure the progress of AI agents. By identifying where AI fails on these tasks, researchers can focus on improving the areas where AI needs the most help, ultimately leading to more capable and useful AI systems.

Abstract

AI agents may soon become capable of autonomously completing valuable, long-horizon tasks in diverse domains. Current benchmarks either do not measure real-world tasks, or are not sufficiently difficult to meaningfully measure frontier models. To this end, we present Terminal-Bench 2.0: a carefully curated hard benchmark composed of 89 tasks in computer terminal environments inspired by problems from real workflows. Each task features a unique environment, human-written solution, and comprehensive tests for verification. We show that frontier models and agents score less than 65\% on the benchmark and conduct an error analysis to identify areas for model and agent improvement. We publish the dataset and evaluation harness to assist developers and researchers in future work at https://www.tbench.ai/ .

View Paper