MobileQuant: Mobile-friendly Quantization for On-device Language Models

Fuwen Tan, Royson Lee, Łukasz Dudziak, Shell Xu Hu, Sourav Bhattacharya, Timothy Hospedales, Georgios Tzimiropoulos, Brais Martinez

2024-08-27

MobileQuant: Mobile-friendly Quantization for On-device Language Models

Summary

This paper presents MobileQuant, a new method for making large language models (LLMs) work efficiently on mobile devices by reducing the amount of data they need to process.

What's the problem?

Large language models are powerful but can be too demanding for mobile devices, which have limited memory and processing power. Current techniques to reduce the size of these models often lead to a drop in accuracy or require too much computational power, making it hard to deploy them on smartphones and other edge devices.

What's the solution?

MobileQuant tackles this issue by using a technique called integer-only quantization, which reduces the size of the model's data while maintaining performance. The authors developed a simple method that optimizes both the weights (the model's parameters) and the range of values used during processing. This approach allows MobileQuant to achieve high accuracy with minimal energy use and faster processing times, making it suitable for mobile-friendly hardware.

Why it matters?

This research is significant because it enables the deployment of advanced AI models on everyday devices, like smartphones. By making LLMs more accessible and efficient, MobileQuant can enhance various applications such as virtual assistants, language translation, and more, ultimately improving user experience without needing expensive hardware.

Abstract

Large language models (LLMs) have revolutionized language processing, delivering outstanding results across multiple applications. However, deploying LLMs on edge devices poses several challenges with respect to memory, energy, and compute costs, limiting their widespread use in devices such as mobile phones. A promising solution is to reduce the number of bits used to represent weights and activations. While existing works have found partial success at quantizing LLMs to lower bitwidths, e.g. 4-bit weights, quantizing activations beyond 16 bits often leads to large computational overheads due to poor on-device quantization support, or a considerable accuracy drop. Yet, 8-bit activations are very attractive for on-device deployment as they would enable LLMs to fully exploit mobile-friendly hardware, e.g. Neural Processing Units (NPUs). In this work, we make a first attempt to facilitate the on-device deployment of LLMs using integer-only quantization. We first investigate the limitations of existing quantization methods for on-device deployment, with a special focus on activation quantization. We then address these limitations by introducing a simple post-training quantization method, named MobileQuant, that extends previous weight equivalent transformation works by jointly optimizing the weight transformation and activation range parameters in an end-to-end manner. MobileQuant demonstrates superior capabilities over existing methods by 1) achieving near-lossless quantization on a wide range of LLM benchmarks, 2) reducing latency and energy consumption by 20\%-50\% compared to current on-device quantization strategies, 3) requiring limited compute budget, 4) being compatible with mobile-friendly compute units, e.g. NPU.

View Paper