The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

Xuwei Ding, Skylar Zhai, Linxin Song, Jiate Li, Taiwei Shi, Nicholas Meade, Siva Reddy, Jian Kang, Jieyu Zhao

2026-04-15

The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use Agents

Summary

This research paper investigates a hidden safety issue with computer agents that can perform tasks on their own in digital environments. It shows that even when given harmless instructions, these agents can still cause problems because of the situation they're in or how they carry out the task.

What's the problem?

Currently, safety checks for these agents focus on obvious dangers like someone trying to trick them or give them malicious commands. However, this paper points out that a big risk comes from situations where the user isn't doing anything wrong, but the agent's actions still lead to harm. Think of it like giving a helpful robot a task that unintentionally causes a negative outcome. The researchers wanted to understand how often this happens and why.

What's the solution?

To study this, the researchers created a new set of 300 tasks, called OS-BLIND, designed to test agents in these tricky, unintended situations. These tasks cover different areas and involve two main types of problems: dangers hidden within the digital environment itself and harms caused by the agent's own actions. They then tested several advanced AI models, including the safety-focused Claude 4.5 Sonnet, both on their own and working together in groups. They found that most agents were easily tricked into causing harm, even Claude 4.5, and that the problem got worse when agents collaborated.

Why it matters?

This research is important because it reveals a significant blind spot in current AI safety measures. Just because an agent is told to do something 'good' doesn't mean it *will* do good. The fact that even advanced, safety-aligned models are vulnerable, especially when working with others, highlights the need for new safety techniques that can recognize and prevent harm even when the initial instructions are perfectly harmless. The researchers are sharing their OS-BLIND benchmark to encourage others to work on this problem.

Abstract

Computer-use agents (CUAs) can now autonomously complete complex tasks in real digital environments, but when misled, they can also be used to automate harmful actions programmatically. Existing safety evaluations largely target explicit threats such as misuse and prompt injection, but overlook a subtle yet critical setting where user instructions are entirely benign and harm arises from the task context or execution outcome. We introduce OS-BLIND, a benchmark that evaluates CUAs under unintended attack conditions, comprising 300 human-crafted tasks across 12 categories, 8 applications, and 2 threat clusters: environment-embedded threats and agent-initiated harms. Our evaluation on frontier models and agentic frameworks reveals that most CUAs exceed 90% attack success rate (ASR), and even the safety-aligned Claude 4.5 Sonnet reaches 73.0% ASR. More interestingly, this vulnerability becomes even more severe, with ASR rising from 73.0% to 92.7% when Claude 4.5 Sonnet is deployed in multi-agent systems. Our analysis further shows that existing safety defenses provide limited protection when user instructions are benign. Safety alignment primarily activates within the first few steps and rarely re-engages during subsequent execution. In multi-agent systems, decomposed subtasks obscure the harmful intent from the model, causing safety-aligned models to fail. We will release our OS-BLIND to encourage the broader research community to further investigate and address these safety challenges.

View Paper