< Explain other AI papers

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

Jianyu Wei, Shijie Cao, Ting Cao, Lingxiao Ma, Lei Wang, Yanyong Zhang, Mao Yang

2024-07-02

T-MAC: CPU Renaissance via Table Lookup for Low-Bit LLM Deployment on Edge

Summary

This paper talks about T-MAC, a new method that makes it easier and more efficient to run large language models (LLMs) on smaller devices, like smartphones and Raspberry Pi computers. It uses a special technique called lookup tables to improve performance without needing a lot of extra resources.

What's the problem?

As technology advances, there's a growing need to run powerful AI models on smaller devices. However, these models often require a lot of memory and processing power, which can be challenging for devices with limited resources. Current methods typically convert low-bit weights (which save space) back into high precision for calculations, which slows things down and uses more energy. This makes it hard to deploy LLMs effectively on edge devices.

What's the solution?

To solve this issue, the authors developed T-MAC, which allows for efficient calculations without needing to convert low-bit weights back to high precision. Instead of traditional multiplication methods, T-MAC uses lookup tables (LUTs) to perform calculations quickly and with less energy. This approach helps the model run faster and reduces the amount of power it consumes. The authors tested T-MAC on different models and found that it significantly improved processing speed—up to four times faster—and reduced energy usage by 70% compared to existing methods.

Why it matters?

This research is important because it enables the use of advanced AI models on devices that don't have a lot of power or memory. By making LLMs more accessible for everyday devices, T-MAC can help improve applications in areas like mobile apps, smart home devices, and other technologies that rely on AI. This advancement could lead to smarter devices that can perform complex tasks without needing powerful computers.

Abstract

The deployment of Large Language Models (LLMs) on edge devices is increasingly important to enhance on-device intelligence. Weight quantization is crucial for reducing the memory footprint of LLMs on devices. However, low-bit LLMs necessitate mixed precision matrix multiplication (mpGEMM) of low precision weights and high precision activations during inference. Existing systems, lacking native support for mpGEMM, resort to dequantize weights for high precision computation. Such an indirect way can lead to a significant inference overhead. In this paper, we introduce T-MAC, an innovative lookup table(LUT)-based method designed for efficient low-bit LLM (i.e., weight-quantized LLM) inference on CPUs. T-MAC directly supports mpGEMM without dequantization, while simultaneously eliminating multiplications and reducing additions required. Specifically, T-MAC transforms the traditional data-type-centric multiplication to bit-wise table lookup, and enables a unified and scalable mpGEMM solution. Our LUT-based kernels scale linearly to the weight bit-width. Evaluated on low-bit Llama and BitNet models, T-MAC demonstrates up to 4x increase in throughput and 70% reduction in energy consumption compared to llama.cpp. For BitNet-b1.58-3B, T-MAC delivers a token generation throughput of 30 tokens/s with a single core and 71 tokens/s with eight cores on M2-Ultra, and 11 tokens/s on lower-end devices like Raspberry Pi 5, which significantly exceeds the adult average reading speed. T-MAC with LUT-based computing paradigm, paves the way for the practical deployment of low-bit LLMs on resource-constrained edge devices without compromising computational efficiency. The system is open-sourced at https://github.com/microsoft/T-MAC.