CocoaBench: Evaluating Unified Digital Agents in the Wild

CocoaBench Team, Shibo Hao, Zhining Zhang, Zhiqi Liang, Tianyang Liu, Yuheng Zha, Qiyue Gao, Jixuan Chen, Zilong Wang, Zhoujun Cheng, Haoxiang Zhang, Junli Wang, Hexi Jin, Boyuan Zheng, Kun Zhou, Yu Wang, Feng Yao, Licheng Liu, Yijiang Li, Zhifei Li, Zhengtao Han, Pracha Promthaw

2026-04-14

CocoaBench: Evaluating Unified Digital Agents in the Wild

Summary

This paper introduces a new way to test how well AI agents can handle complex tasks that require them to use multiple skills like seeing, searching the internet, and writing code.

What's the problem?

Currently, AI agents are tested on individual skills separately. This doesn't reflect real-world situations where agents need to combine these skills to complete a task. We don't have a good way to see how well agents perform when they need to do everything together, making it hard to know where they still struggle.

What's the solution?

The researchers created CocoaBench, a set of challenging tasks designed to test agents' ability to use vision, search, and coding all in one go. These tasks are given with simple instructions and automatically checked for correctness. They also built CocoaAgent, a basic framework to make it easier to compare different AI models on these tasks.

Why it matters?

This work is important because it shows that current AI agents aren't very reliable when faced with tasks requiring multiple skills, only succeeding about 45% of the time. It highlights areas where AI needs to improve, like planning, using tools correctly, and understanding what they 'see', ultimately pushing the field towards building more capable and versatile AI assistants.

Abstract

LLM agents now perform strongly in software engineering, deep research, GUI automation, and various other applications, while recent agent scaffolds and models are increasingly integrating these capabilities into unified systems. Yet, most evaluations still test these capabilities in isolation, which leaves a gap for more diverse use cases that require agents to combine different capabilities. We introduce CocoaBench, a benchmark for unified digital agents built from human-designed, long-horizon tasks that require flexible composition of vision, search, and coding. Tasks are specified only by an instruction and an automatic evaluation function over the final output, enabling reliable and scalable evaluation across diverse agent infrastructures. We also present CocoaAgent, a lightweight shared scaffold for controlled comparison across model backbones. Experiments show that current agents remain far from reliable on CocoaBench, with the best evaluated system achieving only 45.1% success rate. Our analysis further points to substantial room for improvement in reasoning and planning, tool use and execution, and visual grounding.

View Paper