TPI-LLM: Serving 70B-scale LLMs Efficiently on Low-resource Edge Devices
Zonghang Li, Wenjiao Feng, Mohsen Guizani, Hongfang Yu
2024-10-02

Summary
This paper discusses TPI-LLM, a new system designed to efficiently run very large language models (LLMs) on low-resource edge devices, like smartphones or small computers, while keeping user data private.
What's the problem?
As people become more concerned about privacy, there's a shift from using powerful AI models in the cloud to running them on local devices. However, these edge devices often have limited computing power, memory, and internet bandwidth, making it difficult to use large models effectively. Traditional methods for distributing tasks across devices can be inefficient, especially for single users.
What's the solution?
The authors propose TPI-LLM, which uses a method called tensor parallelism to distribute the workload of running large models across multiple devices. This system allows for better memory management through a sliding window scheduler that helps load and unload data as needed. TPI-LLM also implements a star-based communication method to reduce delays when sharing information between devices. The results showed that TPI-LLM significantly reduces the time it takes to generate responses and decreases memory usage compared to other systems.
Why it matters?
This research is important because it enables the use of powerful AI models on everyday devices without compromising user privacy. By making large language models more accessible and efficient on low-resource devices, TPI-LLM opens up new possibilities for applications in personal assistants, smart home devices, and more, while ensuring that sensitive information stays local.
Abstract
Large model inference is shifting from cloud to edge due to concerns about the privacy of user interaction data. However, edge devices often struggle with limited computing power, memory, and bandwidth, requiring collaboration across multiple devices to run and speed up LLM inference. Pipeline parallelism, the mainstream solution, is inefficient for single-user scenarios, while tensor parallelism struggles with frequent communications. In this paper, we argue that tensor parallelism can be more effective than pipeline on low-resource devices, and present a compute- and memory-efficient tensor parallel inference system, named TPI-LLM, to serve 70B-scale models. TPI-LLM keeps sensitive raw data local in the users' devices and introduces a sliding window memory scheduler to dynamically manage layer weights during inference, with disk I/O latency overlapped with the computation and communication. This allows larger models to run smoothly on memory-limited devices. We analyze the communication bottleneck and find that link latency, not bandwidth, emerges as the main issue, so a star-based allreduce algorithm is implemented. Through extensive experiments on both emulated and real testbeds, TPI-LLM demonstrated over 80% less time-to-first-token and token latency compared to Accelerate, and over 90% compared to Transformers and Galaxy, while cutting the peak memory footprint of Llama 2-70B by 90%, requiring only 3.1 GB of memory for 70B-scale models.