Mind the Third Eye! Benchmarking Privacy Awareness in MLLM-powered Smartphone Agents
Zhixin Lin, Jungang Li, Shidong Pan, Yibo Shi, Yue Yao, Dongliang Xu
2025-08-28
Summary
This research investigates how well smartphone assistants, powered by advanced AI, protect your private information. These assistants are getting really good at helping with tasks, but they need access to a lot of personal data to do so.
What's the problem?
Smartphone assistants are becoming more capable, but there's a concern that they aren't careful enough with sensitive user data like passwords or personal details. We didn't really know *how* much of a problem this was, or which assistants were better or worse at protecting privacy. Essentially, we needed a way to systematically test these assistants' privacy awareness.
What's the solution?
The researchers created a large set of 7,138 different situations, or 'scenarios', designed to test how these assistants handle private information. They also categorized the type and sensitivity of the private data in each scenario. Then, they tested seven popular smartphone assistants, giving them these scenarios and seeing if they recognized and protected the private information. They measured how often the assistants correctly identified privacy concerns, even when given hints.
Why it matters?
The results showed that most of these assistants aren't very good at protecting your privacy, often failing to recognize sensitive information. This is a big deal because it highlights a trade-off between how useful these assistants are and how well they protect your personal data. The findings encourage developers to focus on improving privacy features in these AI assistants so you can use them without constantly worrying about your information being exposed.
Abstract
Smartphones bring significant convenience to users but also enable devices to extensively record various types of personal information. Existing smartphone agents powered by Multimodal Large Language Models (MLLMs) have achieved remarkable performance in automating different tasks. However, as the cost, these agents are granted substantial access to sensitive users' personal information during this operation. To gain a thorough understanding of the privacy awareness of these agents, we present the first large-scale benchmark encompassing 7,138 scenarios to the best of our knowledge. In addition, for privacy context in scenarios, we annotate its type (e.g., Account Credentials), sensitivity level, and location. We then carefully benchmark seven available mainstream smartphone agents. Our results demonstrate that almost all benchmarked agents show unsatisfying privacy awareness (RA), with performance remaining below 60% even with explicit hints. Overall, closed-source agents show better privacy ability than open-source ones, and Gemini 2.0-flash achieves the best, achieving an RA of 67%. We also find that the agents' privacy detection capability is highly related to scenario sensitivity level, i.e., the scenario with a higher sensitivity level is typically more identifiable. We hope the findings enlighten the research community to rethink the unbalanced utility-privacy tradeoff about smartphone agents. Our code and benchmark are available at https://zhixin-l.github.io/SAPA-Bench.