ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands

Siyuan Hu, Kevin Qinghong Lin, Mike Zheng Shou

2026-01-14

ShowUI-π: Flow-based Generative Models as GUI Dexterous Hands

Summary

This paper introduces a new AI system, ShowUI-π, designed to interact with computer interfaces like a human using a mouse and keyboard, but with a focus on more complex actions than just clicking.

What's the problem?

Current AI agents that control computers typically work by predicting where to click on the screen. This works for simple tasks, but it struggles with actions that require continuous movement, like dragging something across the screen – think of adjusting a slider or dragging a file into a folder. These actions need the AI to constantly watch what’s happening and adjust its movements in real-time, something simple click predictions can’t handle.

What's the solution?

The researchers developed ShowUI-π, which is different because it can handle both clicks *and* drags within the same system. It predicts how to move the cursor smoothly by making small adjustments based on what it sees on the screen. To train this AI, they created a large dataset of 20,000 drag actions across programs like PowerPoint and Adobe Premiere Pro, and also created a new testing benchmark called ScreenDrag to measure how well these AI agents can perform drag-based tasks. They show their system outperforms existing AI agents, even large ones like Gemini, while being much smaller in size.

Why it matters?

This work is important because it’s a step towards creating AI agents that can truly automate tasks on computers in a way that feels natural and human-like. Being able to handle complex, continuous actions like dragging opens the door to automating a much wider range of tasks, making computers more helpful and easier to use.

Abstract

Building intelligent agents capable of dexterous manipulation is essential for achieving human-like automation in both robotics and digital environments. However, existing GUI agents rely on discrete click predictions (x,y), which prohibits free-form, closed-loop trajectories (e.g. dragging a progress bar) that require continuous, on-the-fly perception and adjustment. In this work, we develop ShowUI-π, the first flow-based generative model as GUI dexterous hand, featuring the following designs: (i) Unified Discrete-Continuous Actions, integrating discrete clicks and continuous drags within a shared model, enabling flexible adaptation across diverse interaction modes; (ii) Flow-based Action Generation for drag modeling, which predicts incremental cursor adjustments from continuous visual observations via a lightweight action expert, ensuring smooth and stable trajectories; (iii) Drag Training data and Benchmark, where we manually collect and synthesize 20K drag trajectories across five domains (e.g. PowerPoint, Adobe Premiere Pro), and introduce ScreenDrag, a benchmark with comprehensive online and offline evaluation protocols for assessing GUI agents' drag capabilities. Our experiments show that proprietary GUI agents still struggle on ScreenDrag (e.g. Operator scores 13.27, and the best Gemini-2.5-CUA reaches 22.18). In contrast, ShowUI-π achieves 26.98 with only 450M parameters, underscoring both the difficulty of the task and the effectiveness of our approach. We hope this work advances GUI agents toward human-like dexterous control in digital world. The code is available at https://github.com/showlab/showui-pi.

View Paper