Towards a Science of AI Agent Reliability

Stephan Rabanser, Sayash Kapoor, Peter Kirgis, Kangheng Liu, Saiteja Utpala, Arvind Narayanan

2026-02-19

Towards a Science of AI Agent Reliability

Summary

This paper points out that while AI agents are getting better at tasks according to standard tests, they still often mess up in real-world situations, and current testing methods don't fully explain why.

What's the problem?

The main issue is that we usually judge AI agents based on whether they succeed or fail, ignoring *how* they succeed or fail. A single success rate doesn't tell us if an agent is reliable – does it give the same answer every time, can it handle small changes in the situation, does it fail in a way we can understand, and are its mistakes usually small or potentially dangerous? This lack of detailed understanding is a problem, especially when we're trusting AI with important jobs.

What's the solution?

The researchers created a more thorough way to evaluate AI agents by looking at twelve different measurements. These measurements break down reliability into four key areas: consistency (does it give the same answer repeatedly?), robustness (can it handle slight changes?), predictability (do its failures make sense?), and safety (how bad are its mistakes?). They tested fourteen different AI models on two different sets of tasks using these new measurements.

Why it matters?

This work is important because it shows that simply improving an AI's overall accuracy isn't enough. The tests revealed that even with recent improvements in AI capabilities, reliability hasn't improved much. These new measurements give us a better way to understand how AI agents work, where they struggle, and how they might fail, which is crucial for building trustworthy AI systems, especially for safety-critical applications.

Abstract

AI agents are increasingly deployed to execute important tasks. While rising accuracy scores on standard benchmarks suggest rapid progress, many agents still continue to fail in practice. This discrepancy highlights a fundamental limitation of current evaluations: compressing agent behavior into a single success metric obscures critical operational flaws. Notably, it ignores whether agents behave consistently across runs, withstand perturbations, fail predictably, or have bounded error severity. Grounded in safety-critical engineering, we provide a holistic performance profile by proposing twelve concrete metrics that decompose agent reliability along four key dimensions: consistency, robustness, predictability, and safety. Evaluating 14 agentic models across two complementary benchmarks, we find that recent capability gains have only yielded small improvements in reliability. By exposing these persistent limitations, our metrics complement traditional evaluations while offering tools for reasoning about how agents perform, degrade, and fail.

View Paper