WebWalker: Benchmarking LLMs in Web Traversal

Jialong Wu, Wenbiao Yin, Yong Jiang, Zhenglin Wang, Zekun Xi, Runnan Fang, Deyu Zhou, Pengjun Xie, Fei Huang

2025-01-14

WebWalker: Benchmarking LLMs in Web Traversal

Summary

This paper talks about a new way to make AI better at finding information on websites. The researchers created a tool called WebWalker that helps AI navigate through web pages more like a human would, looking for answers to complex questions.

What's the problem?

Current AI systems are good at answering questions, but they sometimes struggle when the information is spread across multiple web pages or buried deep within a website. Regular search engines often only find surface-level information, which isn't always enough for tricky questions that need more detailed answers.

What's the solution?

The researchers came up with two main things. First, they made WebWalkerQA, which is like a test to see how well AI can search through websites. Then, they created WebWalker, which is a clever AI system that acts kind of like a person browsing the internet. It has one part that explores web pages and another part that decides if it's found the right information. This helps the AI dig deeper into websites to find better answers.

Why it matters?

This matters because it could make AI much better at helping people find information online, especially for complicated questions. It could lead to smarter search engines and virtual assistants that can understand and navigate websites more like humans do. This could be super helpful for research, learning, or any situation where you need to find detailed information from multiple web pages.

Abstract

Retrieval-augmented generation (RAG) demonstrates remarkable performance across tasks in open-domain question-answering. However, traditional search engines may retrieve shallow content, limiting the ability of LLMs to handle complex, multi-layered information. To address it, we introduce WebWalkerQA, a benchmark designed to assess the ability of LLMs to perform web traversal. It evaluates the capacity of LLMs to traverse a website's subpages to extract high-quality data systematically. We propose WebWalker, which is a multi-agent framework that mimics human-like web navigation through an explore-critic paradigm. Extensive experimental results show that WebWalkerQA is challenging and demonstrates the effectiveness of RAG combined with WebWalker, through the horizontal and vertical integration in real-world scenarios.

View Paper