PowerInfer-2: Fast Large Language Model Inference on a Smartphone

Zhenliang Xue, Yixin Song, Zeyu Mi, Le Chen, Yubin Xia, Haibo Chen

2024-06-13

PowerInfer-2: Fast Large Language Model Inference on a Smartphone

Summary

This paper presents PowerInfer-2, a new framework that allows smartphones to quickly run large language models (LLMs), even those that are too big to fit in the phone's memory. It uses innovative techniques to make the process faster and more efficient.

What's the problem?

Running large language models on smartphones is challenging because these models can be very large, often exceeding the device's memory capacity. Traditional methods for processing these models can be slow and inefficient, making it hard for smartphones to keep up with the demands of modern AI applications.

What's the solution?

PowerInfer-2 solves this problem by breaking down complex calculations into smaller tasks called neuron cluster computations. This framework uses a special engine that adjusts how it processes information based on the type of task being performed. It also includes features like segmented neuron caching and pipelining, which help reduce delays caused by loading data. As a result, PowerInfer-2 can support various large models and achieve speeds up to 29.2 times faster than other existing systems on smartphones.

Why it matters?

This research is important because it enables smartphones to run advanced AI models more efficiently, opening up new possibilities for applications like real-time language translation, smart assistants, and more. By making it easier to use large models on mobile devices, PowerInfer-2 could enhance user experiences and expand the capabilities of smartphones.

Abstract

This paper introduces PowerInfer-2, a framework designed for high-speed inference of Large Language Models (LLMs) on smartphones, particularly effective for models whose sizes exceed the device's memory capacity. The key insight of PowerInfer-2 is to utilize the heterogeneous computation, memory, and I/O resources in smartphones by decomposing traditional matrix computations into fine-grained neuron cluster computations. Specifically, PowerInfer-2 features a polymorphic neuron engine that adapts computational strategies for various stages of LLM inference. Additionally, it introduces segmented neuron caching and fine-grained neuron-cluster-level pipelining, which effectively minimize and conceal the overhead caused by I/O operations. The implementation and evaluation of PowerInfer-2 demonstrate its capability to support a wide array of LLM models on two smartphones, achieving up to a 29.2x speed increase compared with state-of-the-art frameworks. Notably, PowerInfer-2 is the first system to serve the TurboSparse-Mixtral-47B model with a generation rate of 11.68 tokens per second on a smartphone. For models that fit entirely within the memory, PowerInfer-2 can achieve approximately a 40% reduction in memory usage while maintaining inference speeds comparable to llama.cpp and MLC-LLM. For more details, including a demonstration video, please visit the project site at www.powerinfer.ai/v2.

View Paper