Claw-Eval-Live: A Live Agent Benchmark for Evolving Real-World Workflows
Chenxin Li, Zhengyang Tang, Huangxin Lin, Yunlong Lin, Shijue Huang, Shengyuan Liu, Bowen Ye, Rang Li, Lei Li, Benyou Wang, Yixuan Yuan
2026-05-01
Summary
This paper introduces a new way to test AI agents that are designed to handle complex tasks involving multiple software tools and services, like a virtual assistant managing your work life.
What's the problem?
Currently, testing these AI agents is tricky because benchmarks often use a fixed set of tasks. This means the tests don't reflect how real-world needs change over time, and it's hard to know if the agent actually *did* the work correctly, or just gave a good-sounding answer. Existing tests mainly focus on the final result, not the steps taken to get there.
What's the solution?
The researchers created 'Claw-Eval-Live,' a testing system that constantly updates its tasks based on what people are actually asking AI agents to do. It uses a set of 500 common 'skills' and creates realistic scenarios with specific tools and workspaces. Importantly, it doesn't just check the final answer; it records *everything* the agent does – every step, every log, and even changes made to files – to verify the work. When possible, it uses automatic checks, and only uses a more complex AI judge when it needs to assess things that require understanding meaning.
Why it matters?
This research shows that even the best AI agents still struggle with complex, real-world tasks. They only succeed about two-thirds of the time. The new testing method highlights where agents are weakest, particularly in areas like human resources, management, and coordinating multiple systems. It also emphasizes that simply looking at an agent's overall score isn't enough; you need to examine how it performs on individual tasks to truly understand its capabilities and where improvements are needed.
Abstract
LLM agents are expected to complete end-to-end units of work across software tools, business services, and local workspaces. Yet many agent benchmarks freeze a curated task set at release time and grade mainly the final response, making it difficult to evaluate agents against evolving workflow demand or verify whether a task was executed. We introduce Claw-Eval-Live, a live benchmark for workflow agents that separates a refreshable signal layer, updated across releases from public workflow-demand signals, from a reproducible, time-stamped release snapshot. Each release is constructed from public workflow-demand signals, with ClawHub Top-500 skills used in the current release, and materialized as controlled tasks with fixed fixtures, services, workspaces, and graders. For grading, Claw-Eval-Live records execution traces, audit logs, service state, and post-run workspace artifacts, using deterministic checks when evidence is sufficient and structured LLM judging only for semantic dimensions. The release contains 105 tasks spanning controlled business services and local workspace repair, and evaluates 13 frontier models under a shared public pass rule. Experiments reveal that reliable workflow automation remains far from solved: the leading model passes only 66.7% of tasks and no model reaches 70%. Failures are structured by task family and execution surface, with HR, management, and multi-system business workflows as persistent bottlenecks and local workspace repair comparatively easier but unsaturated. Leaderboard rank alone is insufficient because models with similar pass rates can diverge in overall completion, and task-level discrimination concentrates in a middle band of tasks. Claw-Eval-Live suggests that workflow-agent evaluation should be grounded twice, in fresh external demand and in verifiable agent action.