Taming the Chaos: Coordinated Autoscaling for Heterogeneous and Disaggregated LLM Inference
Rongzhi Li, Ruogu Du, Zefang Chu, Sida Zhao, Chunlei Han, Zuocheng Shi, Yiwen Shao, Huanle Han, Long Huang, Zherui Liu, Shufan Liu
2025-08-28
Summary
This paper focuses on making it more efficient to run powerful AI language models, specifically dealing with the challenges of managing the computer resources needed to serve them to many users at once.
What's the problem?
Running these large language models requires a lot of processing power from specialized computer chips called GPUs. Modern systems split the work into two parts: 'prefill' which prepares the initial response, and 'decode' which generates the actual text. Traditional methods for automatically adjusting resources to meet demand don't work well with this split-up approach, leading to wasted computing power, slow network connections, and imbalances between the prefill and decode stages – sometimes one part is overloaded while the other sits idle.
What's the solution?
The researchers developed a system called HeteroScale. It intelligently manages resources by considering the different types of GPUs available and how they connect to each other. It uses a single, key measurement to decide how much processing power to dedicate to both the prefill and decode stages, keeping them balanced and ensuring resources are used effectively. Essentially, it's a smarter autoscaler designed for this specific type of AI model setup.
Why it matters?
HeteroScale significantly improves how efficiently these AI models are run. By increasing GPU utilization by over 26%, it saves a huge amount of computing time – hundreds of thousands of GPU-hours every day – and reduces costs, all while still delivering a fast and reliable experience for users. This is important because it makes these powerful AI tools more accessible and sustainable to operate.
Abstract
Serving Large Language Models (LLMs) is a GPU-intensive task where traditional autoscalers fall short, particularly for modern Prefill-Decode (P/D) disaggregated architectures. This architectural shift, while powerful, introduces significant operational challenges, including inefficient use of heterogeneous hardware, network bottlenecks, and critical imbalances between prefill and decode stages. We introduce HeteroScale, a coordinated autoscaling framework that addresses the core challenges of P/D disaggregated serving. HeteroScale combines a topology-aware scheduler that adapts to heterogeneous hardware and network constraints with a novel metric-driven policy derived from the first large-scale empirical study of autoscaling signals in production. By leveraging a single, robust metric to jointly scale prefill and decode pools, HeteroScale maintains architectural balance while ensuring efficient, adaptive resource management. Deployed in a massive production environment on tens of thousands of GPUs, HeteroScale has proven its effectiveness, increasing average GPU utilization by a significant 26.6 percentage points and saving hundreds of thousands of GPU-hours daily, all while upholding stringent service level objectives.