The BrowserGym Ecosystem for Web Agent Research

Thibault Le Sellier De Chezelles, Maxime Gasse, Alexandre Drouin, Massimo Caccia, Léo Boisvert, Megh Thakkar, Tom Marty, Rim Assouel, Sahar Omidi Shayegan, Lawrence Keunho Jang, Xing Han Lù, Ori Yoran, Dehan Kong, Frank F. Xu, Siva Reddy, Quentin Cappart, Graham Neubig, Ruslan Salakhutdinov, Nicolas Chapados, Alexandre Lacoste

2024-12-12

The BrowserGym Ecosystem for Web Agent Research

Summary

This paper talks about the BrowserGym ecosystem, a new framework designed to improve how researchers evaluate and compare web agents that use automation and large language models (LLMs).

What's the problem?

Evaluating web agents has been difficult because existing methods are inconsistent and fragmented. This means that different tests might measure things differently, making it hard to compare results or reproduce experiments. As a result, researchers struggle to develop reliable and effective web agents that can perform tasks on the internet.

What's the solution?

The authors propose the BrowserGym ecosystem, which provides a standardized environment for testing web agents. It includes a unified interface for various benchmarks, allowing researchers to easily integrate new tests while ensuring consistent evaluation. Additionally, they introduce AgentLab, a tool that helps in building and analyzing these agents. This combination simplifies the process of developing web agents and allows for more reliable comparisons across different models.

Why it matters?

This research is important because it addresses the challenges faced by developers of web agents, ultimately leading to better and more efficient AI systems. By creating a standardized testing environment, BrowserGym can accelerate innovation in web automation technologies, making it easier for researchers to create capable agents that can handle real-world tasks effectively.

Abstract

The BrowserGym ecosystem addresses the growing need for efficient evaluation and benchmarking of web agents, particularly those leveraging automation and Large Language Models (LLMs) for web interaction tasks. Many existing benchmarks suffer from fragmentation and inconsistent evaluation methodologies, making it challenging to achieve reliable comparisons and reproducible results. BrowserGym aims to solve this by providing a unified, gym-like environment with well-defined observation and action spaces, facilitating standardized evaluation across diverse benchmarks. Combined with AgentLab, a complementary framework that aids in agent creation, testing, and analysis, BrowserGym offers flexibility for integrating new benchmarks while ensuring consistent evaluation and comprehensive experiment management. This standardized approach seeks to reduce the time and complexity of developing web agents, supporting more reliable comparisons and facilitating in-depth analysis of agent behaviors, and could result in more adaptable, capable agents, ultimately accelerating innovation in LLM-driven automation. As a supporting evidence, we conduct the first large-scale, multi-benchmark web agent experiment and compare the performance of 6 state-of-the-art LLMs across all benchmarks currently available in BrowserGym. Among other findings, our results highlight a large discrepancy between OpenAI and Anthropic's latests models, with Claude-3.5-Sonnet leading the way on almost all benchmarks, except on vision-related tasks where GPT-4o is superior. Despite these advancements, our results emphasize that building robust and efficient web agents remains a significant challenge, due to the inherent complexity of real-world web environments and the limitations of current models.

View Paper