NanoFlow: Towards Optimal Large Language Model Serving Throughput

Kan Zhu, Yilong Zhao, Liangyu Zhao, Gefei Zuo, Yile Gu, Dedong Xie, Yufei Gao, Qinyu Xu, Tian Tang, Zihao Ye, Keisuke Kamahori, Chien-Yu Lin, Stephanie Wang, Arvind Krishnamurthy, Baris Kasikci

2024-08-27

NanoFlow: Towards Optimal Large Language Model Serving Throughput

Summary

This paper discusses NanoFlow, a new framework designed to improve the efficiency of serving large language models (LLMs) by optimizing how resources are used within a single device.

What's the problem?

As the use of large language models grows, there is a need for systems that can handle many users at once without lagging. Current methods focus on using multiple devices but often waste resources on each device, leading to slower performance and higher costs.

What's the solution?

NanoFlow addresses this issue by introducing a method called intra-device parallelism, which allows different resources within a single device (like computing power and memory) to be used more effectively at the same time. It does this by splitting requests into smaller parts (nano-batches) and scheduling operations so that different tasks can run simultaneously. This approach improves the overall speed and efficiency of processing requests.

Why it matters?

This research is significant because it helps make large language models more accessible and efficient, allowing them to serve more users without needing as much computing power. By improving how these models operate, NanoFlow can enhance various applications that rely on AI, such as chatbots, virtual assistants, and more.

Abstract

The increasing usage of Large Language Models (LLMs) has resulted in a surging demand for planet-scale serving systems, where tens of thousands of GPUs continuously serve hundreds of millions of users. Consequently, throughput (under reasonable latency constraints) has emerged as a key metric that determines serving systems' performance. To boost throughput, various methods of inter-device parallelism (e.g., data, tensor, pipeline) have been explored. However, existing methods do not consider overlapping the utilization of different resources within a single device, leading to underutilization and sub-optimal performance. We propose NanoFlow, a novel serving framework that exploits intra-device parallelism, which overlaps the usage of resources including compute, memory, and network within a single device through operation co-scheduling. To exploit intra-device parallelism, NanoFlow introduces two key innovations: First, NanoFlow splits requests into nano-batches at the granularity of operations, which breaks the dependency of sequential operations in LLM inference and enables overlapping; then, to get benefit from overlapping, NanoFlow uses an operation-level pipeline with execution unit scheduling, which partitions the device's functional units and simultaneously executes different operations in each unit. NanoFlow automates the pipeline setup using a parameter search algorithm, which enables easily porting NanoFlow to different models. We implement NanoFlow on NVIDIA GPUs and evaluate end-to-end serving throughput on several popular models such as LLaMA-2-70B, Mixtral 8x7B, LLaMA-3-8B, etc.. With practical workloads, NanoFlow provides 1.91x throughput boost compared to state-of-the-art serving systems achieving 59% to 72% of optimal throughput across ported models.

View Paper