Efficient Agent Evaluation via Diversity-Guided User Simulation
Itay Nakash, George Kour, Ateret Anaby-Tavor
2026-04-28
Summary
This paper introduces a new way to test how well large language models (LLMs) perform when used as chatbots or assistants, focusing on finding weaknesses in their responses during conversations.
What's the problem?
Currently, testing LLMs involves simulating entire conversations over and over again to see if they succeed. This is slow and wasteful because many conversations start the same way, meaning the computer is repeating work. More importantly, this method often misses subtle errors that only happen with unusual user inputs or in specific situations, because it doesn't explore enough different conversation paths.
What's the solution?
The researchers developed a system called DIVERT that's smarter about testing. Instead of starting each conversation from scratch, DIVERT saves the state of the conversation at important points. When it needs to test a different response, it picks up where it left off instead of redoing the beginning. It also intentionally tries out diverse and unexpected user responses to push the LLM into less common, but potentially problematic, scenarios. This branching approach allows it to cover more ground and find more failures with less computing power.
Why it matters?
This research is important because it provides a more efficient and thorough way to evaluate LLMs before they're used in real-world applications. By finding more failures, developers can improve these models and make them more reliable for tasks like customer service, providing information, or assisting with complex problems. Ultimately, better testing leads to better and more trustworthy AI assistants.
Abstract
Large language models (LLMs) are increasingly deployed as customer-facing agents, yet evaluating their reliability remains challenging due to stochastic, multi-turn interactions. Current evaluation protocols rely on linear Monte Carlo rollouts of complete agent-user conversations to estimate success. However, this approach is computationally inefficient, repeatedly regenerating identical early prefixes, and often fails to uncover deep failure modes that arise from rare user behaviors. We introduce DIVERT (Diversity-Induced Evaluation via Branching of Trajectories), an efficient, snapshot-based, coverage-guided user simulation framework for systematic exploration of agent-user interactions. DIVERT captures the full agent-environment state at critical decision points and resumes execution from these snapshots, enabling reuse of shared conversation prefixes and reducing redundant computation. From each junction, the framework branches using targeted, diversity-inducing user responses, allowing directed exploration of alternative interaction paths. By focusing evaluation on semantically diverse and underexplored trajectories, DIVERT improves both efficiency and coverage. Empirical results show that it discovers more failures per token compared to standard linear rollout protocols, while expanding the set of tasks on which failures are identified.