WebGames: Challenging General-Purpose Web-Browsing AI Agents

George Thomas, Alex J. Chan, Jikun Kang, Wenqi Wu, Filippos Christianos, Fraser Greenlee, Andy Toulis, Marvin Purtorab

2025-02-26

WebGames: Challenging General-Purpose Web-Browsing AI Agents

Summary

This paper talks about WebGames, a new set of challenges designed to test how well AI can browse and interact with websites compared to humans

What's the problem?

Current AI systems struggle with tasks on websites that are easy for humans, like clicking buttons or dragging items. We don't have a good way to measure how well AI can do these tasks compared to people

What's the solution?

The researchers created WebGames, a collection of over 50 web-based challenges that are easy for humans but hard for AI. These challenges test different skills like basic clicking, complex interactions, problem-solving, and even playing simple games. They tested several top AI models on these challenges and compared their performance to humans

Why it matters?

This matters because as AI becomes more common in our daily lives, we need to know how well it can handle tasks on websites. WebGames shows that even the best AI is still far behind humans in web browsing skills, with the top AI only succeeding 43.1% of the time compared to humans' 95.7%. This helps researchers understand what areas of web interaction AI needs to improve on, which could lead to better AI assistants for tasks like online shopping or using web applications

Abstract

We introduce WebGames, a comprehensive benchmark suite designed to evaluate general-purpose web-browsing AI agents through a collection of 50+ interactive challenges. These challenges are specifically crafted to be straightforward for humans while systematically testing the limitations of current AI systems across fundamental browser interactions, advanced input processing, cognitive tasks, workflow automation, and interactive entertainment. Our framework eliminates external dependencies through a hermetic testing environment, ensuring reproducible evaluation with verifiable ground-truth solutions. We evaluate leading vision-language models including GPT-4o, Claude Computer-Use, Gemini-1.5-Pro, and Qwen2-VL against human performance. Results reveal a substantial capability gap, with the best AI system achieving only 43.1% success rate compared to human performance of 95.7%, highlighting fundamental limitations in current AI systems' ability to handle common web interaction patterns that humans find intuitive. The benchmark is publicly available at webgames.convergence.ai, offering a lightweight, client-side implementation that facilitates rapid evaluation cycles. Through its modular architecture and standardized challenge specifications, WebGames provides a robust foundation for measuring progress in development of more capable web-browsing agents.

View Paper