Agent S: An Open Agentic Framework that Uses Computers Like a Human

Saaket Agashe, Jiuzhou Han, Shuyu Gan, Jiachen Yang, Ang Li, Xin Eric Wang

2024-10-13

Agent S: An Open Agentic Framework that Uses Computers Like a Human

Summary

This paper introduces Agent S, a new framework that allows computers to be used more like humans by automating complex tasks through a graphical user interface (GUI).

What's the problem?

Automating tasks on computers can be difficult because it requires understanding specific knowledge about the tasks, planning for long-term actions, and dealing with constantly changing interfaces. Current systems struggle with these challenges, making it hard for them to perform complex, multi-step tasks effectively.

What's the solution?

To tackle these issues, the authors developed Agent S, which uses a method called experience-augmented hierarchical planning. This means that Agent S learns from both past experiences and external knowledge to break down complicated tasks into smaller, manageable steps. It also uses an Agent-Computer Interface (ACI) that helps the computer better understand what to do in a GUI environment. The results show that Agent S performs significantly better than previous methods, achieving higher success rates on various benchmarks.

Why it matters?

This research is important because it represents a step forward in making AI systems more capable of handling real-world computer tasks autonomously. By improving how these systems learn and interact with their environments, Agent S could lead to more efficient and user-friendly applications in many areas, such as personal assistants, automation tools, and other technologies that require complex decision-making.

Abstract

We present Agent S, an open agentic framework that enables autonomous interaction with computers through a Graphical User Interface (GUI), aimed at transforming human-computer interaction by automating complex, multi-step tasks. Agent S aims to address three key challenges in automating computer tasks: acquiring domain-specific knowledge, planning over long task horizons, and handling dynamic, non-uniform interfaces. To this end, Agent S introduces experience-augmented hierarchical planning, which learns from external knowledge search and internal experience retrieval at multiple levels, facilitating efficient task planning and subtask execution. In addition, it employs an Agent-Computer Interface (ACI) to better elicit the reasoning and control capabilities of GUI agents based on Multimodal Large Language Models (MLLMs). Evaluation on the OSWorld benchmark shows that Agent S outperforms the baseline by 9.37% on success rate (an 83.6% relative improvement) and achieves a new state-of-the-art. Comprehensive analysis highlights the effectiveness of individual components and provides insights for future improvements. Furthermore, Agent S demonstrates broad generalizability to different operating systems on a newly-released WindowsAgentArena benchmark. Code available at https://github.com/simular-ai/Agent-S.

View Paper