Survey on Evaluation of LLM-based Agents

Asaf Yehudai, Lilach Eden, Alan Li, Guy Uziel, Yilun Zhao, Roy Bar-Haim, Arman Cohan, Michal Shmueli-Scheuer

2025-03-21

Survey on Evaluation of LLM-based Agents

Summary

This paper looks at different ways people are testing AI agents that use language models to see how good they are at things like planning, using tools, and remembering stuff.

What's the problem?

We need to figure out the best ways to test these AI agents to make sure they're actually good and can do what we want them to do.

What's the solution?

The researchers looked at a bunch of different tests and organized them into categories, like tests for basic skills, tests for specific jobs, and tests for general knowledge.

Why it matters?

This work matters because it helps us understand how to properly evaluate AI agents and figure out what areas need more improvement.

Abstract

The emergence of LLM-based agents represents a paradigm shift in AI, enabling autonomous systems to plan, reason, use tools, and maintain memory while interacting with dynamic environments. This paper provides the first comprehensive survey of evaluation methodologies for these increasingly capable agents. We systematically analyze evaluation benchmarks and frameworks across four critical dimensions: (1) fundamental agent capabilities, including planning, tool use, self-reflection, and memory; (2) application-specific benchmarks for web, software engineering, scientific, and conversational agents; (3) benchmarks for generalist agents; and (4) frameworks for evaluating agents. Our analysis reveals emerging trends, including a shift toward more realistic, challenging evaluations with continuously updated benchmarks. We also identify critical gaps that future research must address-particularly in assessing cost-efficiency, safety, and robustness, and in developing fine-grained, and scalable evaluation methods. This survey maps the rapidly evolving landscape of agent evaluation, reveals the emerging trends in the field, identifies current limitations, and proposes directions for future research.

View Paper