EcomBench: Towards Holistic Evaluation of Foundation Agents in E-commerce
Rui Min, Zile Qiao, Ze Xu, Jiawen Zhai, Wenyu Gao, Xuanzhong Chen, Haozhen Sun, Zhen Zhang, Xinyu Wang, Hong Zhou, Wenbiao Yin, Xuan Zhou, Yong Jiang, Haicheng Liu, Liang Ding, Ling Zou, Yi R., Fung, Yalong Li, Pengjun Xie
2025-12-10
Summary
This paper introduces a new way to test how well AI agents can handle real-world tasks, specifically within the world of online shopping.
What's the problem?
Currently, most tests for AI agents are done in simplified, artificial environments that don't accurately reflect the complexities of actual applications. This makes it hard to know if an agent that performs well in a test will actually be useful when dealing with real customers and changing market conditions, like you'd find in an online store.
What's the solution?
The researchers created a benchmark called EcomBench, which uses real user requests and data from major e-commerce platforms. This benchmark includes different types of shopping tasks and varying levels of difficulty, designed to test an agent's ability to find information, think through multiple steps, and combine knowledge from different sources. Human experts checked the data to make sure it was accurate and relevant.
Why it matters?
EcomBench is important because it provides a more realistic and challenging way to evaluate AI agents. By testing agents in a setting that closely mirrors real-world e-commerce, we can get a better understanding of their true capabilities and build more reliable and helpful AI shopping assistants.
Abstract
Foundation agents have rapidly advanced in their ability to reason and interact with real environments, making the evaluation of their core capabilities increasingly important. While many benchmarks have been developed to assess agent performance, most concentrate on academic settings or artificially designed scenarios while overlooking the challenges that arise in real applications. To address this issue, we focus on a highly practical real-world setting, the e-commerce domain, which involves a large volume of diverse user interactions, dynamic market conditions, and tasks directly tied to real decision-making processes. To this end, we introduce EcomBench, a holistic E-commerce Benchmark designed to evaluate agent performance in realistic e-commerce environments. EcomBench is built from genuine user demands embedded in leading global e-commerce ecosystems and is carefully curated and annotated through human experts to ensure clarity, accuracy, and domain relevance. It covers multiple task categories within e-commerce scenarios and defines three difficulty levels that evaluate agents on key capabilities such as deep information retrieval, multi-step reasoning, and cross-source knowledge integration. By grounding evaluation in real e-commerce contexts, EcomBench provides a rigorous and dynamic testbed for measuring the practical capabilities of agents in modern e-commerce.