ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

Hongjin Su, Shizhe Diao, Ximing Lu, Mingjie Liu, Jiacheng Xu, Xin Dong, Yonggan Fu, Peter Belcak, Hanrong Ye, Hongxu Yin, Yi Dong, Evelina Bakhturina, Tao Yu, Yejin Choi, Jan Kautz, Pavlo Molchanov

2025-12-03

ToolOrchestra: Elevating Intelligence via Efficient Model and Tool Orchestration

Summary

This paper explores a new way to build really smart AI systems by combining smaller, specialized AI 'tools' with a central 'brain' that manages them, instead of relying on one gigantic AI model.

What's the problem?

Current large language models, while powerful, struggle with complex tasks that require deep reasoning and problem-solving, like the 'Humanity's Last Exam'. These models are also very expensive to run because of their massive size and computational needs. Essentially, it's hard to make AI truly intelligent *and* efficient at the same time.

What's the solution?

The researchers developed a method called ToolOrchestra to train a small 'orchestrator' AI. This orchestrator doesn't try to *do* everything itself; instead, it learns to choose and coordinate different AI tools to tackle a problem. They used a type of learning called reinforcement learning, rewarding the orchestrator for good outcomes, efficiency, and choosing tools that people prefer. The resulting orchestrator, named Orchestrator, is relatively small (8 billion parameters) but outperforms much larger models like GPT-5 on several challenging benchmarks.

Why it matters?

This research shows that building AI systems with a 'divide and conquer' approach – using many specialized tools managed by a smaller central AI – is a promising path forward. It's more effective, more efficient, and potentially more scalable than simply building ever-larger single models. This could make advanced AI more accessible and practical for real-world applications.

Abstract

Large language models are powerful generalists, yet solving deep and complex problems such as those of the Humanity's Last Exam (HLE) remains both conceptually challenging and computationally expensive. We show that small orchestrators managing other models and a variety of tools can both push the upper bound of intelligence and improve efficiency in solving difficult agentic tasks. We introduce ToolOrchestra, a method for training small orchestrators that coordinate intelligent tools. ToolOrchestra explicitly uses reinforcement learning with outcome-, efficiency-, and user-preference-aware rewards. Using ToolOrchestra, we produce Orchestrator, an 8B model that achieves higher accuracy at lower cost than previous tool-use agents while aligning with user preferences on which tools are to be used for a given query. On HLE, Orchestrator achieves a score of 37.1%, outperforming GPT-5 (35.1%) while being 2.5x more efficient. On tau2-Bench and FRAMES, Orchestrator surpasses GPT-5 by a wide margin while using only about 30% of the cost. Extensive analysis shows that Orchestrator achieves the best trade-off between performance and cost under multiple metrics, and generalizes robustly to unseen tools. These results demonstrate that composing diverse tools with a lightweight orchestration model is both more efficient and more effective than existing methods, paving the way for practical and scalable tool-augmented reasoning systems.

View Paper