ShowUI-Aloha: Human-Taught GUI Agent
Yichun Zhang, Xiangwu Guo, Yauhong Goh, Jessica Hu, Zhiheng Chen, Xin Wang, Difei Gao, Mike Zheng Shou
2026-01-13
Summary
This paper introduces a new system, ShowUI-Aloha, designed to help computers learn to use graphical user interfaces (GUIs) – things like windows, buttons, and menus – more like humans do.
What's the problem?
Teaching computers to automate tasks within GUIs is really hard because it requires a lot of training data. Getting this data is difficult; recordings of people using computers are often long, messy, and don't come with explanations of *why* someone clicked on something or typed a certain thing. It's like trying to learn a recipe just by watching someone cook without them telling you what they're doing.
What's the solution?
ShowUI-Aloha tackles this by creating a system that takes raw screen recordings of people using their computers and turns them into structured, understandable tasks. It does this in four steps: first, it records everything happening on the screen. Second, it figures out what the user was trying to do based on their actions and what was visible on the screen, and then describes it in plain language. Third, it plans out how to accomplish the task step-by-step. Finally, it actually performs those steps on the computer, making sure everything is done safely and providing feedback along the way.
Why it matters?
This work is important because it provides a way to automatically create the training data needed to build computers that can intelligently interact with GUIs. This could lead to programs that can help people with their computer tasks, automate repetitive jobs, or even assist people who have difficulty using computers themselves, all by simply watching how humans do things.
Abstract
Graphical User Interfaces (GUIs) are central to human-computer interaction, yet automating complex GUI tasks remains a major challenge for autonomous agents, largely due to a lack of scalable, high-quality training data. While recordings of human demonstrations offer a rich data source, they are typically long, unstructured, and lack annotations, making them difficult for agents to learn from.To address this, we introduce ShowUI-Aloha, a comprehensive pipeline that transforms unstructured, in-the-wild human screen recordings from desktop environments into structured, actionable tasks. Our framework includes four key components: A recorder that captures screen video along with precise user interactions like mouse clicks, keystrokes, and scrolls. A learner that semantically interprets these raw interactions and the surrounding visual context, translating them into descriptive natural language captions. A planner that reads the parsed demonstrations, maintains task states, and dynamically formulates the next high-level action plan based on contextual reasoning. An executor that faithfully carries out these action plans at the OS level, performing precise clicks, drags, text inputs, and window operations with safety checks and real-time feedback. Together, these components provide a scalable solution for collecting and parsing real-world human data, demonstrating a viable path toward building general-purpose GUI agents that can learn effectively from simply observing humans.