Inference Performance Optimization for Large Language Models on CPUs

Pujiang He, Shan Zhou, Wenhuan Huang, Changqing Li, Duyi Wang, Bin Guo, Chen Meng, Sheng Gui, Weifei Yu, Yi Xie

2024-07-11

Inference Performance Optimization for Large Language Models on CPUs

Summary

This paper talks about a new method for improving how large language models (LLMs) perform when running on CPUs, especially in situations where resources are limited. It focuses on making these models faster and more efficient without requiring expensive hardware like GPUs.

What's the problem?

The main problem is that LLMs are usually very powerful but also very resource-intensive, which means they often need high-end hardware like GPUs to run effectively. However, not everyone has access to this kind of hardware, and using it can be expensive. This creates a need for ways to optimize LLMs so they can run well on more common hardware like CPUs.

What's the solution?

To solve this issue, the authors developed a set of techniques to optimize LLM performance on CPUs. They focus on reducing the size of the KV cache, which helps manage memory better while maintaining accuracy in the model's outputs. They also introduced a distributed inference optimization approach that allows multiple CPUs to work together efficiently. The authors conducted specific optimizations for popular LLMs and made their code available for others to use.

Why it matters?

This research is important because it makes it easier and cheaper for people and organizations to use advanced language models without needing expensive GPUs. By optimizing LLMs for CPUs, more users can access these powerful tools, which can lead to wider applications in areas like education, healthcare, and business where AI can provide valuable insights and assistance.

Abstract

Large language models (LLMs) have shown exceptional performance and vast potential across diverse tasks. However, the deployment of LLMs with high performance in low-resource environments has garnered significant attention in the industry. When GPU hardware resources are limited, we can explore alternative options on CPUs. To mitigate the financial burden and alleviate constraints imposed by hardware resources, optimizing inference performance is necessary. In this paper, we introduce an easily deployable inference performance optimization solution aimed at accelerating LLMs on CPUs. In this solution, we implement an effective way to reduce the KV cache size while ensuring precision. We propose a distributed inference optimization approach and implement it based on oneAPI Collective Communications Library. Furthermore, we propose optimization approaches for LLMs on CPU, and conduct tailored optimizations for the most commonly used models. The code is open-sourced at https://github.com/intel/xFasterTransformer.

View Paper