Code Agent can be an End-to-end System Hacker: Benchmarking Real-world Threats of Computer-use Agent
Weidi Luo, Qiming Zhang, Tianyu Lu, Xiaogeng Liu, Bin Hu, Hung-Chun Chiu, Siyuan Ma, Yizhe Zhang, Xusheng Xiao, Yinzhi Cao, Zhen Xiang, Chaowei Xiao
2025-10-09
Summary
This research investigates how secure computer-use agents (CUAs) are, which are basically AI assistants that can interact with and control your computer's operating system. These agents are becoming more common and powerful, so understanding their security risks is crucial.
What's the problem?
Currently, testing the security of CUAs isn't very realistic. Existing tests don't consider how a real attacker would actually try to break into a system, they often don't simulate a full attack, they use simplified environments without things like networks or secure passwords, and they rely on the AI itself to judge if an attack was successful – which isn't very reliable. Essentially, we don't know how easily a bad actor could use these AI tools to cause real damage.
What's the solution?
The researchers created a new testing framework called AdvCUA. This framework uses a detailed list of known attack methods (based on something called MITRE ATT&CK) and sets up a realistic computer network with multiple computers and secure credentials. It then systematically tests five popular CUAs – ReAct, AutoGPT, Gemini CLI, Cursor CLI, and Cursor IDE – using 140 different attack scenarios, and uses pre-defined rules to determine if the attacks were successful, rather than relying on the AI's judgment.
Why it matters?
The results showed that current CUAs are surprisingly vulnerable to security threats. This is concerning because these tools could allow even someone with limited technical skills to launch sophisticated attacks on computer systems. It raises important questions about who is responsible for the security of these AI assistants and how to prevent them from being misused.
Abstract
Computer-use agent (CUA) frameworks, powered by large language models (LLMs) or multimodal LLMs (MLLMs), are rapidly maturing as assistants that can perceive context, reason, and act directly within software environments. Among their most critical applications is operating system (OS) control. As CUAs in the OS domain become increasingly embedded in daily operations, it is imperative to examine their real-world security implications, specifically whether CUAs can be misused to perform realistic, security-relevant attacks. Existing works exhibit four major limitations: Missing attacker-knowledge model on tactics, techniques, and procedures (TTP), Incomplete coverage for end-to-end kill chains, unrealistic environment without multi-host and encrypted user credentials, and unreliable judgment dependent on LLM-as-a-Judge. To address these gaps, we propose AdvCUA, the first benchmark aligned with real-world TTPs in MITRE ATT&CK Enterprise Matrix, which comprises 140 tasks, including 40 direct malicious tasks, 74 TTP-based malicious tasks, and 26 end-to-end kill chains, systematically evaluates CUAs under a realistic enterprise OS security threat in a multi-host environment sandbox by hard-coded evaluation. We evaluate the existing five mainstream CUAs, including ReAct, AutoGPT, Gemini CLI, Cursor CLI, and Cursor IDE based on 8 foundation LLMs. The results demonstrate that current frontier CUAs do not adequately cover OS security-centric threats. These capabilities of CUAs reduce dependence on custom malware and deep domain expertise, enabling even inexperienced attackers to mount complex enterprise intrusions, which raises social concern about the responsibility and security of CUAs.