CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents

Jiayu Liu, Cheng Qian, Zhaochen Su, Qing Zong, Shijue Huang, Bingxiang He, Yi R. Fung

2025-11-06

CostBench: Evaluating Multi-Turn Cost-Optimal Planning and Adaptation in Dynamic Environments for LLM Tool-Use Agents

Summary

This paper introduces a new way to test how well AI agents, specifically Large Language Models, can plan and adjust to changes while also being mindful of costs.

What's the problem?

Currently, we mostly test AI agents on whether they can *finish* a task, but we don't really check if they can do it in the *most efficient* way, considering things like how much each step costs or what happens when something goes wrong. This is a problem because real-world situations are unpredictable and require agents to be both smart and economical.

What's the solution?

The researchers created a benchmark called CostBench, which focuses on travel planning. This benchmark presents agents with tasks that can be solved in multiple ways, each with different costs. It also throws unexpected problems at the agents, like tools breaking or prices changing, forcing them to replan on the fly. They then tested several AI models, including a very advanced one, to see how well they handled these cost-related challenges.

Why it matters?

The results showed that even the best AI models struggle with cost-effective planning, especially when things change unexpectedly. This research highlights a key weakness in current AI agents and provides a tool, CostBench, to help developers build future AI that can make smart, economical decisions in a dynamic world.

Abstract

Current evaluations of Large Language Model (LLM) agents primarily emphasize task completion, often overlooking resource efficiency and adaptability. This neglects a crucial capability: agents' ability to devise and adjust cost-optimal plans in response to changing environments. To bridge this gap, we introduce CostBench, a scalable, cost-centric benchmark designed to evaluate agents' economic reasoning and replanning abilities. Situated in the travel-planning domain, CostBench comprises tasks solvable via multiple sequences of atomic and composite tools with diverse, customizable costs. It also supports four types of dynamic blocking events, such as tool failures and cost changes, to simulate real-world unpredictability and necessitate agents to adapt in real time. Evaluating leading open-sourced and proprietary models on CostBench reveals a substantial gap in cost-aware planning: agents frequently fail to identify cost-optimal solutions in static settings, with even GPT-5 achieving less than 75% exact match rate on the hardest tasks, and performance further dropping by around 40% under dynamic conditions. By diagnosing these weaknesses, CostBench lays the groundwork for developing future agents that are both economically rational and robust.

View Paper