TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks
Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, Graham Neubig
2024-12-19
Summary
This paper talks about TheAgentCompany, a new benchmark designed to evaluate how well AI agents perform real-world tasks that are similar to what human workers do, like browsing the internet and writing code.
What's the problem?
As AI technology advances, there is a growing interest in using AI agents to help with work-related tasks. However, it's unclear how effective these agents are at completing actual tasks without human intervention. This uncertainty is important for businesses considering adopting AI and for understanding its impact on jobs in the economy.
What's the solution?
TheAgentCompany provides a structured way to test AI agents by creating a simulated environment that mimics a small software company. This environment includes various tasks that workers might perform, allowing researchers to see how well different AI agents can complete these tasks. They found that the best-performing agent could complete 24% of the tasks on its own, indicating that while AI can handle some simpler jobs, it still struggles with more complex tasks.
Why it matters?
This research is significant because it helps businesses understand the capabilities and limitations of AI agents in real-world settings. By evaluating how well these agents can perform tasks similar to those done by human workers, TheAgentCompany provides valuable insights for companies looking to integrate AI into their operations and helps policymakers consider the effects of AI on the job market.
Abstract
We interact with computers on an everyday basis, be it in everyday life or work, and many aspects of work can be done entirely with access to a computer and the Internet. At the same time, thanks to improvements in large language models (LLMs), there has also been a rapid development in AI agents that interact with and affect change in their surrounding environments. But how performant are AI agents at helping to accelerate or even autonomously perform work-related tasks? The answer to this question has important implications for both industry looking to adopt AI into their workflows, and for economic policy to understand the effects that adoption of AI may have on the labor market. To measure the progress of these LLM agents' performance on performing real-world professional tasks, in this paper, we introduce TheAgentCompany, an extensible benchmark for evaluating AI agents that interact with the world in similar ways to those of a digital worker: by browsing the Web, writing code, running programs, and communicating with other coworkers. We build a self-contained environment with internal web sites and data that mimics a small software company environment, and create a variety of tasks that may be performed by workers in such a company. We test baseline agents powered by both closed API-based and open-weights language models (LMs), and find that with the most competitive agent, 24% of the tasks can be completed autonomously. This paints a nuanced picture on task automation with LM agents -- in a setting simulating a real workplace, a good portion of simpler tasks could be solved autonomously, but more difficult long-horizon tasks are still beyond the reach of current systems.