Fast Matrix Multiplications for Lookup Table-Quantized LLMs
Han Guo, William Brandon, Radostin Cholakov, Jonathan Ragan-Kelley, Eric P. Xing, Yoon Kim
2024-07-22

Summary
This paper introduces FLUTE, a new engine designed to speed up matrix multiplications for large language models (LLMs) that use lookup table quantization. This method helps make LLMs more efficient by reducing the amount of data they need to process.
What's the problem?
When using large language models, transferring data from the computer's memory to the processing unit can be slow and inefficient, especially when dealing with quantized weights that are not evenly sized. This can create a bottleneck that slows down the model's performance, particularly during tasks like generating text or processing data.
What's the solution?
The authors developed FLUTE, which restructures the way quantized weight matrices are handled. By minimizing unnecessary data movements and optimizing how the model accesses its lookup tables, FLUTE can perform matrix multiplications much faster—up to 2-4 times quicker than existing methods. They also tested FLUTE on a specific model called LLaMA3 and found it significantly improved processing speeds while maintaining high performance.
Why it matters?
This research is important because it enhances the efficiency of large language models, making them faster and more practical for real-world applications. By improving how these models handle data, FLUTE can help in various fields such as natural language processing, machine learning, and AI development, ultimately leading to better and more responsive AI systems.
Abstract
The deployment of large language models (LLMs) is often constrained by memory bandwidth, where the primary bottleneck is the cost of transferring model parameters from the GPU's global memory to its registers. When coupled with custom kernels that fuse the dequantization and matmul operations, weight-only quantization can thus enable faster inference by reducing the amount of memory movement. However, developing high-performance kernels for weight-quantized LLMs presents substantial challenges, especially when the weights are compressed to non-evenly-divisible bit widths (e.g., 3 bits) with non-uniform, lookup table (LUT) quantization. This paper describes FLUTE, a flexible lookup table engine for LUT-quantized LLMs, which uses offline restructuring of the quantized weight matrix to minimize bit manipulations associated with unpacking, and vectorization and duplication of the lookup table to mitigate shared memory bandwidth constraints. At batch sizes < 32 and quantization group size of 128 (typical in LLM inference), the FLUTE kernel can be 2-4x faster than existing GEMM kernels. As an application of FLUTE, we explore a simple extension to lookup table-based NormalFloat quantization and apply it to quantize LLaMA3 to various configurations, obtaining competitive quantization performance against strong baselines while obtaining an end-to-end throughput increase of 1.5 to 2 times.