AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents
Harsh Trivedi, Tushar Khot, Mareike Hartmann, Ruskin Manku, Vinty Dong, Edward Li, Shashank Gupta, Ashish Sabharwal, Niranjan Balasubramanian
2024-07-29

Summary
This paper presents AppWorld, a new platform designed to test and improve autonomous agents that can perform everyday digital tasks using various applications. It focuses on creating a realistic environment where these agents can learn to interact with apps and generate complex code.
What's the problem?
Many existing benchmarks for testing coding agents only require simple tasks that involve basic API calls. This limits the ability to evaluate how well these agents can handle more complicated, real-world scenarios that involve multiple applications and require generating rich, interactive code. As a result, there is a need for a more comprehensive testing environment that reflects actual user interactions with apps.
What's the solution?
To address this issue, the authors developed the AppWorld Engine, which includes nine common apps and allows agents to operate through 457 APIs. They created the AppWorld Benchmark, which consists of 750 diverse and challenging tasks that require agents to write complex code while interacting with these apps. The platform also includes robust evaluation methods to assess how well agents perform these tasks, including checking for any unintended consequences of their actions. The results showed that even advanced models like GPT-4o struggled with many tasks, highlighting the benchmark's difficulty.
Why it matters?
This research is important because it provides a new way to evaluate and improve AI agents that assist with daily digital tasks. By creating a realistic testing environment, AppWorld can help developers build better agents that can handle complex interactions across multiple applications, ultimately making technology more helpful and efficient for users.
Abstract
Autonomous agents that address day-to-day digital tasks (e.g., ordering groceries for a household), must not only operate multiple apps (e.g., notes, messaging, shopping app) via APIs, but also generate rich code with complex control flow in an iterative manner based on their interaction with the environment. However, existing benchmarks for tool use are inadequate, as they only cover tasks that require a simple sequence of API calls. To remedy this gap, we built AppWorld Engine, a high-quality execution environment (60K lines of code) of 9 day-to-day apps operable via 457 APIs and populated with realistic digital activities simulating the lives of ~100 fictitious users. We then created AppWorld Benchmark (40K lines of code), a suite of 750 natural, diverse, and challenging autonomous agent tasks requiring rich and interactive code generation. It supports robust programmatic evaluation with state-based unit tests, allowing for different ways of completing a task while also checking for unexpected changes, i.e., collateral damage. The state-of-the-art LLM, GPT-4o, solves only ~49% of our 'normal' tasks and ~30% of 'challenge' tasks, while other models solve at least 16% fewer. This highlights the benchmark's difficulty and AppWorld's potential to push the frontiers of interactive coding agents. The project website is available at https://appworld.dev/.