Understand and Accelerate Memory Processing Pipeline for Disaggregated LLM Inference
Zifan He, Rui Ma, Yizhou Sun, Jason Cong
2026-04-02
Summary
This paper investigates how to make large language models, like those powering chatbots, run faster and use less energy when dealing with long pieces of text. It focuses on the parts of these models that handle memory and information retrieval.
What's the problem?
Large language models need to process a lot of information to give good answers, and this process can be slow and consume a lot of power. The paper found that a significant portion of the time and energy is spent on managing the model's 'memory' – preparing information, figuring out what's relevant, retrieving it, and then using it to generate a response. Current systems often handle all these steps on the same type of processor, which isn't always the most efficient approach.
What's the solution?
The researchers realized that different parts of the memory processing pipeline are better suited for different types of hardware. They built a system that uses both GPUs (good at complex calculations) and FPGAs (good at handling irregular and memory-focused tasks). They offloaded the memory-intensive parts, like finding relevant information, to the FPGA, while letting the GPU handle the main processing. This division of labor speeds things up and reduces energy consumption.
Why it matters?
This work shows that using a combination of different types of processors – a 'heterogeneous system' – is a promising way to improve the performance and efficiency of large language models. This is important because it could make these powerful models more accessible and sustainable, allowing them to be used in more applications without requiring massive amounts of computing power.
Abstract
Modern large language models (LLMs) increasingly depends on efficient long-context processing and generation mechanisms, including sparse attention, retrieval-augmented generation (RAG), and compressed contextual memory, to support complex reasoning. We show that these optimizations can be unified into a four-step memory processing pipeline: Prepare Memory, Compute Relevancy, Retrieval, and Apply to Inference. Through systematic profiling, we identify a 22%-97% memory processing overhead in LLM inference and strong heterogeneity in its computational characteristics. Motivated by this insight, we argue that heterogeneous systems are well-suited to accelerate memory processing and thus end-to-end inference. We demonstrate this approach on a GPU-FPGA system by offloading sparse, irregular, and memory-bounded operations to FPGAs while retaining compute-intensive operations on GPUs. Evaluated on an AMD MI210 GPU and an Alveo U55C FPGA, our system is 1.04sim2.2times faster and requires 1.11sim4.7times less energy across multiple LLM inference optimizations than the GPU baseline (similar results hold on NVIDIA A100). These results establish heterogeneous systems as a practical direction for efficient LLM memory processing and inform future heterogeneous hardware design.