DeepFlow: Serverless Large Language Model Serving at Scale

Junhao Hu, Jiang Xu, Zhixia Liu, Yulong He, Yuetao Chen, Hao Xu, Jiang Liu, Baoquan Zhang, Shining Wan, Gengyuan Dan, Zhiyu Dong, Zhihao Ren, Jie Meng, Chao He, Changhong Liu, Tao Xie, Dayun Lin, Qin Zhang, Yue Yu, Hao Feng, Xusheng Chen, Yizhou Shan

2025-01-29

DeepFlow: Serverless Large Language Model Serving at Scale

Summary

This paper talks about DeepFlow, a new system that makes it easier and faster to use big AI language models in the cloud. It's like creating a smart, efficient delivery service for AI, making sure the AI can work quickly and cheaply when lots of people need to use it at once.

What's the problem?

Big AI language models are super smart, but they're hard to use efficiently when many people want to access them at the same time. It's like having a genius who can answer any question, but they get overwhelmed when too many people ask questions at once. The main issues are figuring out how to use computer resources wisely, serve answers quickly, and start up fast when needed.

What's the solution?

The researchers created DeepFlow, which does several clever things to solve these problems. First, it organizes AI tasks in a simple way that's easy to manage. Then, it uses a special engine called FlowServe that's really good at running AI models efficiently. DeepFlow also has smart ways of scheduling tasks and preparing the system in advance, so it can quickly scale up to handle 64 times more work in just seconds. They've been using this system successfully for over a year on a big cluster of special AI computers.

Why it matters?

This matters because it could make powerful AI more accessible and affordable for businesses and researchers. It's like turning that overwhelmed genius into a super-efficient team that can answer millions of questions quickly and cheaply. This could lead to more companies using advanced AI in their products and services, potentially improving things like customer service, content creation, and data analysis for everyone. It also shows how we can make AI systems work better in the real world, which is crucial as AI becomes more important in our daily lives.

Abstract

This paper introduces DeepFlow, a scalable and serverless AI platform designed to efficiently serve large language models (LLMs) at scale in cloud environments. DeepFlow addresses key challenges such as resource allocation, serving efficiency, and cold start latencies through four main design components. First, it uses a simple serverless abstraction called the request-job-task model, which helps manage AI workloads across post-training and model serving tasks. Second, it builds an in-house serving engine FlowServe using a microkernel-inspired design, NPU-centric execution, and SPMD-based parallelism to optimize LLM serving. The system also includes novel scheduling policies tailored for both PD-disaggregated and PD-colocated configurations. With optimizations like pre-warmed pods, DRAM pre-loading, and NPU-fork, DeepFlow can scale up to 64 instances in seconds. DeepFlow has been in production for over a year, operating on a large Ascend NPU cluster and providing industrystandard APIs for fine-tuning, agent serving, and model serving to our customers.

View Paper