Turbo Sparse: Achieving LLM SOTA Performance with Minimal Activated Parameters
Yixin Song, Haotong Xie, Zhengyan Zhang, Bo Wen, Li Ma, Zeyu Mi, Haibo Chen
2024-06-13
Summary
This paper introduces Turbo Sparse, a new method that helps large language models (LLMs) run faster and more efficiently by using fewer activated parameters while still maintaining high performance.
What's the problem?
Large language models are powerful AI systems that can understand and generate text, but they often require a lot of computational resources because they have billions of parameters. Traditional methods for optimizing these models, like using common activation functions, do not effectively reduce the number of parameters that need to be activated during processing. This inefficiency can slow down the model's performance and make it harder to use on devices with limited resources.
What's the solution?
To solve this problem, the authors developed a new activation function called dReLU, which improves how sparsity is achieved in LLMs. Sparsity means that only a small number of parameters are activated at any time, which speeds up processing. They also used a technique called Mixture-of-Experts (MoE) to further enhance efficiency by activating only specific parts of the model when needed. By applying these methods to their models, they managed to activate only 2.5 billion and 4.3 billion parameters during each processing step while achieving better performance than before. This resulted in a significant speed increase, allowing their TurboSparse-Mixtral-47B model to process data at a rate of 11 tokens per second on mobile devices.
Why it matters?
This research is important because it allows large language models to run more efficiently, making it feasible to use them on devices with limited computing power, such as smartphones. By reducing the number of activated parameters without sacrificing performance, Turbo Sparse helps lower the energy and cost associated with running these advanced AI systems, paving the way for more accessible applications in everyday technology.
Abstract
Exploiting activation sparsity is a promising approach to significantly accelerating the inference process of large language models (LLMs) without compromising performance. However, activation sparsity is determined by activation functions, and commonly used ones like SwiGLU and GeGLU exhibit limited sparsity. Simply replacing these functions with ReLU fails to achieve sufficient sparsity. Moreover, inadequate training data can further increase the risk of performance degradation. To address these challenges, we propose a novel dReLU function, which is designed to improve LLM activation sparsity, along with a high-quality training data mixture ratio to facilitate effective sparsification. Additionally, we leverage sparse activation patterns within the Feed-Forward Network (FFN) experts of Mixture-of-Experts (MoE) models to further boost efficiency. By applying our neuron sparsification method to the Mistral and Mixtral models, only 2.5 billion and 4.3 billion parameters are activated per inference iteration, respectively, while achieving even more powerful model performance. Evaluation results demonstrate that this sparsity achieves a 2-5x decoding speedup. Remarkably, on mobile phones, our TurboSparse-Mixtral-47B achieves an inference speed of 11 tokens per second. Our models are available at https://huggingface.co/PowerInfer