Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph

Fali Wang, Jihai Chen, Shuhua Yang, Runxue Bao, Tianxiang Zhao, Zhiwei Zhang, Xianfeng Tang, Hui Liu, Qi He, Suhang Wang

2025-11-04

Generalizing Test-time Compute-optimal Scaling as an Optimizable Graph

Summary

This paper explores how to best use multiple large language models (LLMs) working together to get better results when you're actually *using* the models, a process called Test-Time Scaling. It focuses on finding the most efficient way to combine these models and how they should communicate with each other, all while staying within a certain computational budget.

What's the problem?

Currently, when people try to speed up LLMs using Test-Time Scaling, they usually stick to pre-defined ways of connecting the models. This isn't ideal because the best way to combine models and how they should share information changes depending on the specific task you're trying to accomplish. The challenge is that there are *so* many possible combinations of models and connection methods that it's incredibly difficult to find the absolute best one, and figuring out what works best for one task doesn't guarantee it will work for another.

What's the solution?

The researchers treated the problem like designing a network, where each LLM is a node and the connections between them are the edges. They realized a full search of all possibilities is impossible, so they used a clever approach. They observed patterns in how these networks work and then created a system called Agent-REINFORCE. This system uses another LLM as an 'agent' to explore different network designs, getting feedback on how well each design performs and then adjusting the network to improve it. It's similar to how machine learning models learn from trial and error, but instead of numbers, the feedback is in the form of text.

Why it matters?

This research is important because it makes LLMs more efficient and accurate. By automatically finding the best way to combine models during use, we can get better performance without needing more powerful hardware. This means more people can access and benefit from these powerful AI tools, and it opens the door to tackling more complex problems that require the combined intelligence of multiple LLMs.

Abstract

Test-Time Scaling (TTS) improves large language models (LLMs) by allocating additional computation during inference, typically through parallel, sequential, or hybrid scaling. However, prior studies often assume fixed collaboration architectures (e.g., topologies) and single-model usage, overlooking that optimal architectures and model combinations can vary across tasks. Therefore, we study the novel problem of searching for compute-optimal model combinations and architectures in TTS under a fixed budget. We formalize it as a multi-LLM collaboration graph, where nodes encode roles and LLM model assignments, and edges capture information flow. This problem is challenging because (i) the combinatorial search space is prohibitively large, and (ii) task-specific requirements demand tailored designs. To address these, we reformulate the problem as probabilistic graph optimization and, through pilot experiments, derive three empirical insights into TTS collaboration graphs. Guided by these insights, we propose Agent-REINFORCE, an LLM-agent-augmented framework that mirrors the REINFORCE pipeline by mapping sampling-gradient-update to sampling-feedback-update, where feedback serves as a textual gradient to update the probabilistic graph and efficiently search for optimal multi-LLM collaboration graphs. Experiments show that Agent-REINFORCE outperforms both traditional and LLM-based baselines in sample efficiency and search performance, and effectively identifies optimal graphs under joint objectives of accuracy and inference latency.

View Paper