MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool

Cunchen Hu, Heyang Huang, Junhao Hu, Jiang Xu, Xusheng Chen, Tao Xie, Chenxi Wang, Sa Wang, Yungang Bao, Ninghui Sun, Yizhou Shan

2024-06-27

MemServe: Context Caching for Disaggregated LLM Serving with Elastic Memory Pool

Summary

This paper talks about MemServe, a new system designed to improve how large language models (LLMs) handle requests by using a technique called context caching and managing memory more efficiently. It aims to speed up the process of serving these models to users.

What's the problem?

Traditionally, LLMs operated in a stateless manner, meaning they didn't remember previous interactions. However, as these models evolved to be stateful, they needed better ways to manage memory and data between requests. The existing methods for handling memory were not efficient enough, leading to slower response times and wasted resources when serving multiple requests.

What's the solution?

To solve this problem, the authors introduced MemServe, which includes a component called MemPool. MemPool is an elastic memory pool that manages memory and key-value (KV) caches across different instances of the model. This allows MemServe to combine context caching (storing information from previous interactions) with disaggregated inference (processing parts of the model separately). Additionally, MemServe uses a global scheduler that helps optimize how cached information is reused, making the system faster and more efficient.

Why it matters?

This research is important because it enhances the performance of large language models by improving how they manage memory and respond to user requests. By making these models faster and more efficient, MemServe can help applications that rely on LLMs—such as chatbots, virtual assistants, and other AI-driven services—provide better user experiences.

Abstract

Large language model (LLM) serving has transformed from stateless to stateful systems, utilizing techniques like context caching and disaggregated inference. These optimizations extend the lifespan and domain of the KV cache, necessitating a new architectural approach. We present MemServe, a unified system that integrates both inter-request and intra-request optimizations. MemServe introduces MemPool, an elastic memory pool managing distributed memory and KV caches across serving instances. Using MemPool APIs, MemServe combines context caching with disaggregated inference for the first time, supported by a global scheduler that enhances cache reuse through a global prompt tree-based locality-aware policy. Tests show that MemServe significantly improves job completion time and time-to-first-time.

View Paper