WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents

Yinuo Liu, Ruohan Xu, Xilong Wang, Yuqi Jia, Neil Zhenqiang Gong

2025-10-06

WAInjectBench: Benchmarking Prompt Injection Detections for Web Agents

Summary

This research paper investigates how well current methods can detect attempts to trick web agents – programs that use AI to interact with websites – into doing things they shouldn't. These tricks, called 'prompt injection' attacks, try to manipulate the agent through cleverly crafted inputs.

What's the problem?

Web agents are becoming more common, but they're vulnerable to attacks where someone tries to control them by inserting malicious instructions into what looks like normal requests. While there are ways to detect these attacks in general, nobody had thoroughly tested how well those methods work specifically against web agents. It wasn't clear which defenses were effective and which weren't when dealing with the unique way web agents operate.

What's the solution?

The researchers created a comprehensive testing ground, called WAInjectBench, to evaluate different detection methods. They categorized the types of attacks possible against web agents and then built datasets containing both harmless and harmful examples, including text and images. They then tested existing detection techniques on these datasets to see how accurately they could identify malicious inputs, both obvious and subtle.

Why it matters?

This work is important because it highlights the weaknesses in current defenses against prompt injection attacks targeting web agents. The findings show that many detectors struggle with more sophisticated attacks that don't use obvious commands or visible changes to images. By providing the datasets and code publicly, the researchers hope to encourage the development of more robust security measures for these increasingly popular AI-powered tools.

Abstract

Multiple prompt injection attacks have been proposed against web agents. At the same time, various methods have been developed to detect general prompt injection attacks, but none have been systematically evaluated for web agents. In this work, we bridge this gap by presenting the first comprehensive benchmark study on detecting prompt injection attacks targeting web agents. We begin by introducing a fine-grained categorization of such attacks based on the threat model. We then construct datasets containing both malicious and benign samples: malicious text segments generated by different attacks, benign text segments from four categories, malicious images produced by attacks, and benign images from two categories. Next, we systematize both text-based and image-based detection methods. Finally, we evaluate their performance across multiple scenarios. Our key findings show that while some detectors can identify attacks that rely on explicit textual instructions or visible image perturbations with moderate to high accuracy, they largely fail against attacks that omit explicit instructions or employ imperceptible perturbations. Our datasets and code are released at: https://github.com/Norrrrrrr-lyn/WAInjectBench.

View Paper