HSCodeComp: A Realistic and Expert-level Benchmark for Deep Search Agents in Hierarchical Rule Application

Yiqian Yang, Tian Lan, Qianghuai Jia, Li Zhu, Hui Jiang, Hang Zhu, Longyue Wang, Weihua Luo, Kaifu Zhang

2025-10-23

HSCodeComp: A Realistic and Expert-level Benchmark for Deep Search Agents in Hierarchical Rule Application

Summary

This paper introduces a new challenge for AI agents that goes beyond simply finding information; it tests their ability to understand and *apply* complex rules, like those found in legal documents or trade regulations, to solve a real-world problem.

What's the problem?

Current AI benchmarks don't adequately test how well agents can handle situations requiring them to follow a series of specific, sometimes unclear, rules to reach a conclusion. Think about figuring out what a product is classified as for international shipping – it's not always straightforward! Existing AI systems struggle with this kind of 'deep reasoning' where they need to navigate a hierarchy of rules and make a precise determination.

What's the solution?

The researchers created a benchmark called HSCodeComp. This benchmark uses real product descriptions from e-commerce websites and asks AI agents to predict the correct 10-digit Harmonized System Code (HSCode) – a standardized code used for classifying goods internationally. Human experts annotated these products with the correct codes, providing a standard for comparison. They then tested several advanced AI models on this task.

Why it matters?

This work is important because accurately classifying products for international trade is crucial for a smooth global supply chain. If AI can't reliably do this, it limits its usefulness in areas like e-commerce, logistics, and customs. The results show that even the best AI agents are far from matching human performance on this task, highlighting a significant area for improvement in AI development.

Abstract

Effective deep search agents must not only access open-domain and domain-specific knowledge but also apply complex rules-such as legal clauses, medical manuals and tariff rules. These rules often feature vague boundaries and implicit logic relationships, making precise application challenging for agents. However, this critical capability is largely overlooked by current agent benchmarks. To fill this gap, we introduce HSCodeComp, the first realistic, expert-level e-commerce benchmark designed to evaluate deep search agents in hierarchical rule application. In this task, the deep reasoning process of agents is guided by these rules to predict 10-digit Harmonized System Code (HSCode) of products with noisy but realistic descriptions. These codes, established by the World Customs Organization, are vital for global supply chain efficiency. Built from real-world data collected from large-scale e-commerce platforms, our proposed HSCodeComp comprises 632 product entries spanning diverse product categories, with these HSCodes annotated by several human experts. Extensive experimental results on several state-of-the-art LLMs, open-source, and closed-source agents reveal a huge performance gap: best agent achieves only 46.8% 10-digit accuracy, far below human experts at 95.0%. Besides, detailed analysis demonstrates the challenges of hierarchical rule application, and test-time scaling fails to improve performance further.

View Paper