OS Agents: A Survey on MLLM-based Agents for General Computing Devices Use
Xueyu Hu, Tao Xiong, Biao Yi, Zishu Wei, Ruixuan Xiao, Yurun Chen, Jiasheng Ye, Meiling Tao, Xiangxin Zhou, Ziyu Zhao, Yuhuai Li, Shengze Xu, Shenzhi Wang, Xinchen Xu, Shuofei Qiao, Zhaokai Wang, Kun Kuang, Tieyong Zeng, Liang Wang, Jiwei Li, Yuchen Eleanor Jiang, Wangchunshu Zhou
2025-08-11
Summary
This paper talks about OS Agents, which are smart AI helpers based on large language models that work on computers and phones to perform tasks by interacting with operating system environments like graphical user interfaces. It reviews how these agents are built, what they need to understand and do, and how they are tested for different jobs.
What's the problem?
The problem is that while many AI assistants exist, creating one as flexible and powerful as the fictional J.A.R.V.I.S from Iron Man is hard. These agents must work well within complex computer environments, understand what to do, plan their actions, and safely handle personal data, but that is challenging because of various technical and ethical issues.
What's the solution?
The paper surveys the current state of OS Agents by explaining their basic parts, how they perceive and act in computer environments, and how they are developed using special models and frameworks. It also looks at how researchers measure their performance and discusses important problems like safety, privacy, and adapting to users over time, offering ideas for future improvements.
Why it matters?
This matters because OS Agents can make using computers and devices much easier and more efficient by automating tasks through natural language and GUIs. Understanding their capabilities and challenges helps guide future research and development, bringing us closer to intelligent assistants that can truly help in everyday computing safely and personally.
Abstract
The dream to create AI assistants as capable and versatile as the fictional J.A.R.V.I.S from Iron Man has long captivated imaginations. With the evolution of (multi-modal) large language models ((M)LLMs), this dream is closer to reality, as (M)LLM-based Agents using computing devices (e.g., computers and mobile phones) by operating within the environments and interfaces (e.g., Graphical User Interface (GUI)) provided by operating systems (OS) to automate tasks have significantly advanced. This paper presents a comprehensive survey of these advanced agents, designated as OS Agents. We begin by elucidating the fundamentals of OS Agents, exploring their key components including the environment, observation space, and action space, and outlining essential capabilities such as understanding, planning, and grounding. We then examine methodologies for constructing OS Agents, focusing on domain-specific foundation models and agent frameworks. A detailed review of evaluation protocols and benchmarks highlights how OS Agents are assessed across diverse tasks. Finally, we discuss current challenges and identify promising directions for future research, including safety and privacy, personalization and self-evolution. This survey aims to consolidate the state of OS Agents research, providing insights to guide both academic inquiry and industrial development. An open-source GitHub repository is maintained as a dynamic resource to foster further innovation in this field. We present a 9-page version of our work, accepted by ACL 2025, to provide a concise overview to the domain.