KnowU-Bench: Towards Interactive, Proactive, and Personalized Mobile Agent Evaluation
Tongbo Chen, Zhengxi Lu, Zhan Xu, Guocheng Shao, Shaohan Zhao, Fei Tang, Yong Du, Kaitao Song, Yizhou Liu, Yuchen Yan, Wenqi Zhang, Xu Tan, Weiming Lu, Jun Xiao, Yueting Zhuang, Yongliang Shen
2026-04-10
Summary
This research introduces a new way to test how well AI assistants can truly learn what *you* want and help proactively, going beyond just completing tasks you directly ask for.
What's the problem?
Current tests for AI assistants are too simple. They either give the AI all the information about your preferences upfront, or just test if it can guess what you want in a single situation. They don't test if an AI can *figure out* your preferences over time through conversation, or if it knows *when* to offer help, ask if it's okay to help, or just stay out of your way while you're using an app. Basically, existing benchmarks don't measure the qualities that make a truly helpful, personalized assistant.
What's the solution?
The researchers created a realistic testing environment called KnowU-Bench. It uses a simulated Android phone and 42 common tasks, plus many personalized and proactive ones. Crucially, the AI doesn't start knowing your preferences – it has to learn them from how you behave. They also built a smart 'user simulator' powered by a large language model that can have realistic conversations with the AI to clarify what you want and respond to offers of help. The system evaluates not just if the AI can *do* things, but if it can figure out *what* to do and *when* to do it, including getting your permission first.
Why it matters?
The experiments showed that even the best AI models struggle when they have to figure out your preferences and decide when to intervene. They're good at following direct instructions, but bad at being a truly helpful assistant. This highlights a major gap in AI development – we need to focus on making AI better at understanding *people*, not just operating interfaces, to build trustworthy personal assistants.
Abstract
Personalized mobile agents that infer user preferences and calibrate proactive assistance hold great promise as everyday digital assistants, yet existing benchmarks fail to capture what this requires. Prior work evaluates preference recovery from static histories or intent prediction from fixed contexts. Neither tests whether an agent can elicit missing preferences through interaction, nor whether it can decide when to intervene, seek consent, or remain silent in a live GUI environment. We introduce KnowU-Bench, an online benchmark for personalized mobile agents built on a reproducible Android emulation environment, covering 42 general GUI tasks, 86 personalized tasks, and 64 proactive tasks. Unlike prior work that treats user preferences as static context, KnowU-Bench hides the user profile from the agent and exposes only behavioral logs, forcing genuine preference inference rather than context lookup. To support multi-turn preference elicitation, it instantiates an LLM-driven user simulator grounded in structured profiles, enabling realistic clarification dialogues and proactive consent handling. Beyond personalization, KnowU-Bench provides comprehensive evaluation of the complete proactive decision chain, including grounded GUI execution, consent negotiation, and post-rejection restraint, evaluated through a hybrid protocol combining rule-based verification with LLM-as-a-Judge scoring. Our experiments reveal a striking degradation: agents that excel at explicit task execution fall below 50% under vague instructions requiring user preference inference or intervention calibration, even for frontier models like Claude Sonnet 4.6. The core bottlenecks are not GUI navigation but preference acquisition and intervention calibration, exposing a fundamental gap between competent interface operation and trustworthy personal assistance.