AgentSearchBench: A Benchmark for AI Agent Search in the Wild

Bin Wu, Arastun Mammadli, Xiaoyu Zhang, Emine Yilmaz

2026-04-27

AgentSearchBench: A Benchmark for AI Agent Search in the Wild

Summary

This paper focuses on the growing challenge of finding the right AI agent to help with a specific task, given the huge number of agents now available.

What's the problem?

As more and more AI agents are created, it's becoming difficult to figure out which one is best for a job. Unlike regular software, an agent's abilities aren't always clear just from reading what it's *supposed* to do; you often need to see it *in action*. Current methods for testing agents either assume you already know what you're looking for, or only let you test them with very specific instructions, which isn't realistic when you're trying to solve a real-world problem with a general description.

What's the solution?

The researchers created a new benchmark called AgentSearchBench. This benchmark includes almost 10,000 real AI agents from different companies. They designed tests where agents are searched for using both specific instructions *and* broader descriptions of what needs to be done. Crucially, they judged how well an agent performed by actually letting it *try* to complete the task, rather than just relying on how similar its description was to the task. They also experimented with using short 'probe' tasks to quickly get a sense of an agent's abilities.

Why it matters?

This work shows that simply matching keywords or descriptions isn't enough to find the best AI agent. It highlights the need to consider how an agent actually *behaves* when choosing one for a task. By demonstrating the value of 'behavioral signals,' this research points the way towards better methods for discovering and utilizing the rapidly expanding world of AI agents, making them more useful for everyone.

Abstract

The rapid growth of AI agent ecosystems is transforming how complex tasks are delegated and executed, creating a new challenge of identifying suitable agents for a given task. Unlike traditional tools, agent capabilities are often compositional and execution-dependent, making them difficult to assess from textual descriptions alone. However, existing research and benchmarks typically assume well-specified functionalities, controlled candidate pools, or only executable task queries, leaving realistic agent search scenarios insufficiently studied. We introduce AgentSearchBench, a large-scale benchmark for agent search in the wild, built from nearly 10,000 real-world agents across multiple providers. The benchmark formalizes agent search as retrieval and reranking problems under both executable task queries and high-level task descriptions, and evaluates relevance using execution-grounded performance signals. Experiments reveal a consistent gap between semantic similarity and actual agent performance, exposing the limitations of description-based retrieval and reranking methods. We further show that lightweight behavioral signals, including execution-aware probing, can substantially improve ranking quality, highlighting the importance of incorporating execution signals into agent discovery. Our code is available at https://github.com/Bingo-W/AgentSearchBench.

View Paper