ClawMark: A Living-World Benchmark for Multi-Turn, Multi-Day, Multimodal Coworker Agents
Fanqing Meng, Lingxiao Du, Zijian Wu, Guanzheng Chen, Xiangyan Liu, Jiaqi Liao, Chonghe Jiang, Zhenglin Wan, Jiawei Gu, Pengfei Zhou, Rui Huang, Ziqi Zhao, Shengyuan Ding, Ailing Yu, Bo Peng, Bowei Xia, Hao Sun, Haotian Liang, Ji Xie, Jiajun Chen, Jiajun Song, Liu Yang
2026-04-28
Summary
This paper introduces a new way to test AI agents that are designed to act like helpful coworkers over extended periods, like several days. These agents need to be able to handle tasks that change as new information comes in, like new emails or updated files.
What's the problem?
Current tests for AI agents are too simple. They usually only look at a single, short interaction and mostly focus on text-based tasks. This doesn't reflect how agents are actually used in real-world jobs where things are constantly changing and involve different types of information like emails, calendars, and spreadsheets. Existing benchmarks don't accurately measure an agent's ability to adapt to a dynamic environment over multiple days.
What's the solution?
The researchers created a testing environment called 'AutoGenWorkflows' that simulates a real work setting. It includes services like email, calendars, and file systems that change over time. They designed 100 tasks across 13 different professional scenarios and used automated checks (not another AI judging) to see if the agents completed the tasks correctly by looking at the final state of these services. They then tested seven different advanced AI systems on this benchmark.
Why it matters?
This work is important because it provides a more realistic way to evaluate AI agents that are meant to be long-term assistants. The results show that while these agents can make some progress, they still struggle to complete complex tasks from start to finish, especially when the environment changes. This highlights the need for further research on how to make AI agents more adaptable and reliable in dynamic real-world situations.
Abstract
Language-model agents are increasingly used as persistent coworkers that assist users across multiple working days. During such workflows, the surrounding environment may change independently of the agent: new emails arrive, calendar entries shift, knowledge-base records are updated, and evidence appears across images, scanned PDFs, audio, video, and spreadsheets. Existing benchmarks do not adequately evaluate this setting because they typically run within a single static episode and remain largely text-centric. We introduce , a benchmark for coworker agents built around multi-turn multi-day tasks, a stateful sandboxed service environment whose state evolves between turns, and rule-based verification. The current release contains 100 tasks across 13 professional scenarios, executed against five stateful sandboxed services (filesystem, email, calendar, knowledge base, spreadsheet) and scored by 1537 deterministic Python checkers over post-execution service state; no LLM-as-judge is invoked during scoring. We benchmark seven frontier agent systems. The strongest model reaches 75.8 weighted score, but the best strict Task Success is only 20.0\%, indicating that partial progress is common while complete end-to-end workflow completion remains rare. Turn-level analysis shows that performance drops after the first exogenous environment update, highlighting adaptation to changing state as a key open challenge. We release the benchmark, evaluation harness, and construction pipeline to support reproducible coworker-agent evaluation.