Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization

Jiachen Zhu, Lingyu Yang, Rong Shan, Congmin Zheng, Zeyu Zheng, Weiwen Liu, Yong Yu, Weinan Zhang, Jianghao Lin

2026-04-15

Turing Test on Screen: A Benchmark for Mobile GUI Agent Humanization

Summary

This research focuses on making AI agents that interact with apps and websites appear more human-like to avoid being blocked by platforms trying to detect and stop automated activity.

What's the problem?

Currently, AI agents designed to automate tasks on our phones or computers are often easily identified as bots because their actions don't quite mimic how a real person would use a device. Platforms like apps and websites are getting better at spotting these bots and blocking them, and most research focuses on just making the agents *work* rather than making them *blend in*. This means the agents aren't able to survive and complete tasks in the long run because they get detected.

What's the solution?

The researchers developed a way to measure how human-like an agent's actions are, calling it the 'Turing Test on Screen'. They collected a lot of data on how people actually touch and swipe on their phones, and then used this data to train agents to behave more naturally. They tried different techniques, like adding random 'noise' to the agent's movements and having it directly copy human behavior patterns, and found they could significantly improve how well the agent could hide its robotic nature without making it less effective at completing tasks.

Why it matters?

This work is important because it changes the focus from simply *if* an agent can do something to *how* it does it. If AI agents are going to be useful in the real world, they need to be able to operate without being constantly blocked. By making them more human-like, they can coexist more seamlessly with human users and avoid the constant battle against detection, paving the way for more reliable automation.

Abstract

The rise of autonomous GUI agents has triggered adversarial countermeasures from digital platforms, yet existing research prioritizes utility and robustness over the critical dimension of anti-detection. We argue that for agents to survive in human-centric ecosystems, they must evolve Humanization capabilities. We introduce the ``Turing Test on Screen,'' formally modeling the interaction as a MinMax optimization problem between a detector and an agent aiming to minimize behavioral divergence. We then collect a new high-fidelity dataset of mobile touch dynamics, and conduct our analysis that vanilla LMM-based agents are easily detectable due to unnatural kinematics. Consequently, we establish the Agent Humanization Benchmark (AHB) and detection metrics to quantify the trade-off between imitability and utility. Finally, we propose methods ranging from heuristic noise to data-driven behavioral matching, demonstrating that agents can achieve high imitability theoretically and empirically without sacrificing performance. This work shifts the paradigm from whether an agent can perform a task to how it performs it within a human-centric ecosystem, laying the groundwork for seamless coexistence in adversarial digital environments.

View Paper