InteractComp: Evaluating Search Agents With Ambiguous Queries

Mingyi Deng, Lijun Huang, Yani Fan, Jiayi Zhang, Fashen Ren, Jinyi Bai, Fuzhen Yang, Dayi Miao, Zhaoyang Yu, Yifan Wu, Yanfei Zhang, Fengwei Teng, Yingjia Wan, Song Hu, Yude Li, Xin Jin, Conghao Hu, Haoyu Li, Qirui Fu, Tai Zhong, Xinyu Wang, Xiangru Tang

2025-10-29

InteractComp: Evaluating Search Agents With Ambiguous Queries

Summary

This paper focuses on the limitations of current AI search agents, which are really good at finding information but assume you already know exactly what you're looking for. It introduces a new way to test if these agents can handle situations where your initial search isn't clear and requires back-and-forth questioning, like a conversation with a librarian.

What's the problem?

Most search agents act like you give them a perfect question right away. In reality, people often start with vague ideas and need to refine their search through questions and answers. Current AI agents don't really *do* that interactive questioning, and there wasn't a good way to measure how well they *could* do it. Existing tests just give the agent all the information upfront, which isn't realistic.

What's the solution?

The researchers created a new test called InteractComp. It includes 210 questions designed to be intentionally ambiguous, meaning they have multiple possible meanings. The agent has to figure out which meaning the user intends by asking clarifying questions. They tested 17 different AI models and found that even the best one struggled, only getting the right answer about 13% of the time. However, when *forced* to ask questions, the models performed much better, showing they have the potential but aren't using it automatically. They also noticed that the ability to interact hasn't improved much over time, even as the agents have gotten better at regular searches.

Why it matters?

This work is important because it highlights a key weakness in current AI search technology. While agents are getting better at finding information, they're not getting better at understanding *what* people actually want. InteractComp provides a valuable tool for both evaluating and improving the interactive abilities of these agents, making them more helpful and user-friendly in real-world situations.

Abstract

Language agents have demonstrated remarkable potential in web search and information retrieval. However, these search agents assume user queries are complete and unambiguous, an assumption that diverges from reality where users begin with incomplete queries requiring clarification through interaction. Yet most agents lack interactive mechanisms during the search process, and existing benchmarks cannot assess this capability. To address this gap, we introduce InteractComp, a benchmark designed to evaluate whether search agents can recognize query ambiguity and actively interact to resolve it during search. Following the principle of easy to verify, interact to disambiguate, we construct 210 expert-curated questions across 9 domains through a target-distractor methodology that creates genuine ambiguity resolvable only through interaction. Evaluation of 17 models reveals striking failure: the best model achieves only 13.73% accuracy despite 71.50% with complete context, exposing systematic overconfidence rather than reasoning deficits. Forced interaction produces dramatic gains, demonstrating latent capability current strategies fail to engage. Longitudinal analysis shows interaction capabilities stagnated over 15 months while search performance improved seven-fold, revealing a critical blind spot. This stagnation, coupled with the immediate feedback inherent to search tasks, makes InteractComp a valuable resource for both evaluating and training interaction capabilities in search agents. The code is available at https://github.com/FoundationAgents/InteractComp.

View Paper