ClawBench: Can AI Agents Complete Everyday Online Tasks?

Yuxuan Zhang, Yubo Wang, Yipeng Zhu, Penghui Du, Junwen Miao, Xuan Lu, Wendong Xu, Yunzhuo Hao, Songcheng Cai, Xiaochen Wang, Huaisong Zhang, Xian Wu, Yi Lu, Minyi Lei, Kai Zou, Huifeng Yin, Ping Nie, Liang Chen, Dongfu Jiang, Wenhu Chen, Kelsey R. Allen

2026-04-10

ClawBench: Can AI Agents Complete Everyday Online Tasks?

Summary

This paper introduces a new way to test how well AI agents can handle everyday tasks online, like making purchases or applying for jobs.

What's the problem?

Current AI benchmarks often test agents in simplified, controlled environments that don't reflect the real internet. This means we don't really know how well they'll perform when faced with the complexities of actual websites and the many steps involved in common online activities. Existing tests aren't challenging enough to see if AI can truly act as a helpful assistant in our daily lives.

What's the solution?

The researchers created 'ClawBench,' a set of 153 realistic tasks across 144 different websites. These tasks require AI to understand information from documents, navigate complicated website processes, and accurately fill out detailed forms. Importantly, ClawBench tests the AI on *live* websites, but includes a safety feature that prevents the AI from actually completing actions that could have real-world consequences, like making a purchase. They then tested several leading AI models on these tasks.

Why it matters?

The results show that even the most advanced AI models struggle with these everyday tasks, only completing a small percentage successfully. This work highlights the need for further development in AI agents to make them truly useful as general-purpose assistants that can reliably help us with routine online activities.

Abstract

AI agents may be able to automate your inbox, but can they automate other routine aspects of your life? Everyday online tasks offer a realistic yet unsolved testbed for evaluating the next generation of AI agents. To this end, we introduce ClawBench, an evaluation framework of 153 simple tasks that people need to accomplish regularly in their lives and work, spanning 144 live platforms across 15 categories, from completing purchases and booking appointments to submitting job applications. These tasks require demanding capabilities beyond existing benchmarks, such as obtaining relevant information from user-provided documents, navigating multi-step workflows across diverse platforms, and write-heavy operations like filling in many detailed forms correctly. Unlike existing benchmarks that evaluate agents in offline sandboxes with static pages, ClawBench operates on production websites, preserving the full complexity, dynamic nature, and challenges of real-world web interaction. A lightweight interception layer captures and blocks only the final submission request, ensuring safe evaluation without real-world side effects. Our evaluations of 7 frontier models show that both proprietary and open-source models can complete only a small portion of these tasks. For example, Claude Sonnet 4.6 achieves only 33.3%. Progress on ClawBench brings us closer to AI agents that can function as reliable general-purpose assistants.

View Paper