Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Rogerio Bonatti, Dan Zhao, Francesco Bonacci, Dillon Dupont, Sara Abdali, Yinheng Li, Justin Wagle, Kazuhito Koishida, Arthur Bucker, Lawrence Jang, Zack Hui

2024-09-13

Windows Agent Arena: Evaluating Multi-Modal OS Agents at Scale

Summary

This paper presents the Windows Agent Arena, a new testing environment for AI agents that operate on the Windows operating system, allowing researchers to evaluate how well these agents can perform tasks similar to what humans do.

What's the problem?

Measuring how well AI agents perform in real-world scenarios is difficult because most existing tests focus on specific tasks or take a long time to complete. This makes it hard to see how effective these agents are in practical situations.

What's the solution?

The authors created the Windows Agent Arena, which includes over 150 different tasks that AI agents can perform using real Windows applications. This environment allows for faster testing, completing evaluations in about 20 minutes instead of days. They also introduced a new AI agent named Navi to demonstrate the capabilities of this benchmark.

Why it matters?

This research is important because it helps improve the development of AI agents that can assist people in their daily computer tasks. By providing a realistic testing environment, it paves the way for creating more effective and capable AI tools that can enhance productivity and software accessibility.

Abstract

Large language models (LLMs) show remarkable potential to act as computer agents, enhancing human productivity and software accessibility in multi-modal tasks that require planning and reasoning. However, measuring agent performance in realistic environments remains a challenge since: (i) most benchmarks are limited to specific modalities or domains (e.g. text-only, web navigation, Q&A, coding) and (ii) full benchmark evaluations are slow (on order of magnitude of days) given the multi-step sequential nature of tasks. To address these challenges, we introduce the Windows Agent Arena: a reproducible, general environment focusing exclusively on the Windows operating system (OS) where agents can operate freely within a real Windows OS and use the same wide range of applications, tools, and web browsers available to human users when solving tasks. We adapt the OSWorld framework (Xie et al., 2024) to create 150+ diverse Windows tasks across representative domains that require agent abilities in planning, screen understanding, and tool usage. Our benchmark is scalable and can be seamlessly parallelized in Azure for a full benchmark evaluation in as little as 20 minutes. To demonstrate Windows Agent Arena's capabilities, we also introduce a new multi-modal agent, Navi. Our agent achieves a success rate of 19.5% in the Windows domain, compared to 74.5% performance of an unassisted human. Navi also demonstrates strong performance on another popular web-based benchmark, Mind2Web. We offer extensive quantitative and qualitative analysis of Navi's performance, and provide insights into the opportunities for future research in agent development and data generation using Windows Agent Arena. Webpage: https://microsoft.github.io/WindowsAgentArena Code: https://github.com/microsoft/WindowsAgentArena

View Paper