FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

Imene Kerboua, Sahar Omidi Shayegan, Megh Thakkar, Xing Han Lù, Léo Boisvert, Massimo Caccia, Jérémy Espinas, Alexandre Aussem, Véronique Eglin, Alexandre Lacoste

2025-10-06

FocusAgent: Simple Yet Effective Ways of Trimming the Large Context of Web Agents

Summary

This paper explores how to make web-browsing computer programs, called agents, work better when they need to understand long and complex web pages.

What's the problem?

These agents use powerful AI models, but those models have limits on how much text they can process at once. Long web pages easily exceed those limits, making processing slow and expensive. Also, feeding an agent an entire web page can create security holes, allowing hackers to trick the agent into doing things it shouldn't, like through 'prompt injection' attacks.

What's the solution?

The researchers created a system called FocusAgent. It uses a smaller AI model to quickly identify and pull out only the *most important* lines of text from the web page, focusing on what's relevant to the task the agent is trying to complete. This is done by looking at the underlying structure of the webpage, called the accessibility tree. By only giving the main AI model the essential information, FocusAgent works faster, costs less, and is more secure.

Why it matters?

This work shows a practical way to build web agents that are efficient, effective, and safe. By smartly selecting what information the AI sees, we can overcome the limitations of current AI models and reduce the risk of security breaches, making these agents more reliable for everyday tasks.

Abstract

Web agents powered by large language models (LLMs) must process lengthy web page observations to complete user goals; these pages often exceed tens of thousands of tokens. This saturates context limits and increases computational cost processing; moreover, processing full pages exposes agents to security risks such as prompt injection. Existing pruning strategies either discard relevant content or retain irrelevant context, leading to suboptimal action prediction. We introduce FocusAgent, a simple yet effective approach that leverages a lightweight LLM retriever to extract the most relevant lines from accessibility tree (AxTree) observations, guided by task goals. By pruning noisy and irrelevant content, FocusAgent enables efficient reasoning while reducing vulnerability to injection attacks. Experiments on WorkArena and WebArena benchmarks show that FocusAgent matches the performance of strong baselines, while reducing observation size by over 50%. Furthermore, a variant of FocusAgent significantly reduces the success rate of prompt-injection attacks, including banner and pop-up attacks, while maintaining task success performance in attack-free settings. Our results highlight that targeted LLM-based retrieval is a practical and robust strategy for building web agents that are efficient, effective, and secure.

View Paper