ClawArena: Benchmarking AI Agents in Evolving Information Environments
Haonian Ji, Kaiwen Xiong, Siwei Han, Peng Xia, Shi Qiu, Yiyang Zhou, Jiaqi Liu, Jinlong Li, Bingzhou Li, Zeyu Zheng, Cihang Xie, Huaxiu Yao
2026-04-07
Summary
This paper introduces a new way to test AI assistants that are designed to help people over a long period of time, constantly learning and adapting.
What's the problem?
Current tests for AI assistants are too simple; they assume information stays the same and comes from a single, reliable source. Real-world assistants need to deal with information that changes, comes from many places, and sometimes even contradicts itself. They also need to learn what you want without you having to explicitly tell them every detail, picking up on corrections instead. Existing benchmarks don't check if AI can handle this ongoing complexity.
What's the solution?
The researchers created a testing environment called ClawArena. In this environment, the AI assistant only gets partial and sometimes conflicting information from different sources, like chat logs and files. The test focuses on three main challenges: figuring out what to believe when sources disagree, updating its understanding as new information arrives, and learning your preferences from how you correct it. They created 64 different scenarios across various professional fields, with almost 2,000 test rounds and 365 updates to the information. They then tested five different AI systems to see how well they performed.
Why it matters?
This work is important because it provides a more realistic way to evaluate AI assistants. By testing how well they handle changing information and learn from corrections, it helps developers build assistants that are more reliable and helpful in the long run. The results show that the AI's core abilities and how the assistant is designed both matter a lot, and that some designs can help overcome limitations in the AI's basic capabilities.
Abstract
AI agents deployed as persistent assistants must maintain correct beliefs as their information environment evolves. In practice, evidence is scattered across heterogeneous sources that often contradict one another, new information can invalidate earlier conclusions, and user preferences surface through corrections rather than explicit instructions. Existing benchmarks largely assume static, single-authority settings and do not evaluate whether agents can keep up with this complexity. We introduce ClawArena, a benchmark for evaluating AI agents in evolving information environments. Each scenario maintains a complete hidden ground truth while exposing the agent only to noisy, partial, and sometimes contradictory traces across multi-channel sessions, workspace files, and staged updates. Evaluation is organized around three coupled challenges: multi-source conflict reasoning, dynamic belief revision, and implicit personalization, whose interactions yield a 14-category question taxonomy. Two question formats, multi-choice (set-selection) and shell-based executable checks, test both reasoning and workspace grounding. The current release contains 64 scenarios across 8 professional domains, totaling 1{,}879 evaluation rounds and 365 dynamic updates. Experiments on five agent frameworks and five language models show that both model capability (15.4% range) and framework design (9.2%) substantially affect performance, that self-evolving skill frameworks can partially close model-capability gaps, and that belief revision difficulty is determined by update design strategy rather than the mere presence of updates. Code is available at https://github.com/aiming-lab/ClawArena.