AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

Ori Yoran, Samuel Joseph Amouyal, Chaitanya Malaviya, Ben Bogin, Ofir Press, Jonathan Berant

2024-07-23

AssistantBench: Can Web Agents Solve Realistic and Time-Consuming Tasks?

Summary

This paper presents AssistantBench, a new benchmark designed to test how well web agents can perform realistic and time-consuming tasks on the internet. It aims to evaluate the capabilities of language agents built on language models (LMs) in handling complex web interactions.

What's the problem?

While language agents have shown promise in various tasks, most evaluations focus on simple scenarios with single images or straightforward questions. This leaves a gap in understanding how these agents perform in more realistic situations that require navigating multiple web pages and completing tasks that take time, such as monitoring real estate markets or finding local businesses. Existing benchmarks do not adequately measure these capabilities, making it hard to assess the true effectiveness of current web agents.

What's the solution?

To address this issue, the authors created AssistantBench, which includes 214 realistic tasks that can be automatically evaluated. These tasks cover different scenarios and require the agent to autonomously browse the web, gather information, and provide answers. The benchmark reveals significant limitations in current models, with none achieving an accuracy higher than 25%. The authors also introduce a new web agent called SeePlanAct (SPA), which outperforms previous agents and shows improved performance when combined with closed-book models.

Why it matters?

This research is important because it highlights the challenges that current web agents face when dealing with complex tasks on the internet. By providing a comprehensive benchmark like AssistantBench, researchers can better understand the strengths and weaknesses of these systems. This can lead to improvements in AI technology, making web agents more effective at performing real-world tasks, which is crucial for applications in customer service, online shopping, and information retrieval.

Abstract

Language agents, built on top of language models (LMs), are systems that can interact with complex environments, such as the open web. In this work, we examine whether such agents can perform realistic and time-consuming tasks on the web, e.g., monitoring real-estate markets or locating relevant nearby businesses. We introduce AssistantBench, a challenging new benchmark consisting of 214 realistic tasks that can be automatically evaluated, covering different scenarios and domains. We find that AssistantBench exposes the limitations of current systems, including language models and retrieval-augmented language models, as no model reaches an accuracy of more than 25 points. While closed-book LMs perform well, they exhibit low precision since they tend to hallucinate facts. State-of-the-art web agents reach a score of near zero. Additionally, we introduce SeePlanAct (SPA), a new web agent that significantly outperforms previous agents, and an ensemble of SPA and closed-book models reaches the best overall performance. Moreover, we analyze failures of current systems and highlight that web navigation remains a major challenge.

View Paper