BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions

Zhengbo Zhang, Zhiheng Lyu, Junhao Gong, Hongzhu Yi, Xinming Wang, Yuxuan Zhou, Jiabing Yang, Ping Nie, Yan Huang, Wenhu Chen

2025-10-14

BrowserAgent: Building Web Agents with Human-Inspired Web Browsing Actions

Summary

This paper introduces BrowserAgent, a new way to get large language models (LLMs) to solve problems on the internet by actually *using* a web browser, similar to how a person would.

What's the problem?

Current LLMs that try to solve web-based tasks often rely on converting webpages into simple text, losing important interactive elements like buttons, forms, and scrolling. This is a clunky process and doesn't reflect how humans naturally browse the web – we click, scroll, and type to find information. Existing systems also need a lot of training data to work well.

What's the solution?

The researchers created BrowserAgent, which directly controls a web browser using a set of actions like clicking and scrolling. It's trained in two steps: first, it learns from examples, and then it gets refined by rejecting incorrect responses. They also added a 'memory' feature so the model can remember important details as it navigates through multiple webpages to answer complex questions. Importantly, it achieves strong results with less training data than other similar systems.

Why it matters?

BrowserAgent represents a step forward in creating more capable and realistic web-based AI agents. Because it interacts with webpages directly, it can handle more complex tasks and requires less data to learn. This means we're closer to having AI that can autonomously research and solve problems online, much like a human assistant.

Abstract

Efficiently solving real-world problems with LLMs increasingly hinges on their ability to interact with dynamic web environments and autonomously acquire external information. While recent research like Search-R1 and WebDancer demonstrates strong performance in solving web tasks, they heavily rely on additional tools to convert the interactive web environment into static text content. This is in contrast to human browsing behaviors, which involve diverse interactions with the browser, such as scrolling, clicking, and typing. In this paper, we propose BrowserAgent, a more interactive agent that solves complex tasks through human-inspired browser actions. BrowserAgent operates directly on raw web pages via Playwright through a set of predefined browser actions. We adopt a two-stage training (Supervised Fine-Tuning (SFT) and Rejection Fine-Tuning (RFT)) to improve the model's generalization abilities. Despite using significantly less training data than Search-R1, BrowserAgent achieves more competitive results across different Open-QA tasks. Additionally, we introduce an explicit memory mechanism to store key conclusions across steps, further enhancing the model's reasoning capabilities for long-horizon tasks. Notably, BrowserAgent-7B can achieve around 20\% improvement over Search-R1 on multi-hop QA tasks like HotpotQA, 2Wiki, and Bamboogle. These results indicate that BrowserAgent can serve as a more advanced framework for more interactive and scalable web agents.

View Paper