BitNet a4.8: 4-bit Activations for 1-bit LLMs

Hongyu Wang, Shuming Ma, Furu Wei

2024-11-08

BitNet a4.8: 4-bit Activations for 1-bit LLMs

Summary

This paper introduces BitNet a4.8, a new model that uses 4-bit activations to improve the performance of 1-bit large language models (LLMs) while keeping them efficient.

What's the problem?

Large language models are powerful but often require a lot of computing resources, which can make them slow and expensive to use. Traditional models typically use 1-bit weights and activations, which can limit their performance. There is a need for better ways to balance efficiency and effectiveness in these models.

What's the solution?

BitNet a4.8 employs a method that combines hybrid quantization and sparsification strategies. It uses 4-bit activations for processing inputs in certain layers while reducing the complexity of intermediate states with 8-bit quantization. This approach allows the model to maintain high performance while using fewer resources. The experiments showed that BitNet a4.8 performs comparably to previous models but does so faster and more efficiently.

Why it matters?

This research is significant because it demonstrates how advanced techniques can enhance the capabilities of language models without requiring excessive computational power. By making these models more efficient, they can be used in more applications, including on devices with limited resources, which could lead to broader access to AI technology.

Abstract

Recent research on the 1-bit Large Language Models (LLMs), such as BitNet b1.58, presents a promising direction for reducing the inference cost of LLMs while maintaining their performance. In this work, we introduce BitNet a4.8, enabling 4-bit activations for 1-bit LLMs. BitNet a4.8 employs a hybrid quantization and sparsification strategy to mitigate the quantization errors introduced by the outlier channels. Specifically, we utilize 4-bit activations for inputs to the attention and feed-forward network layers, while sparsifying intermediate states followed with 8-bit quantization. Extensive experiments demonstrate that BitNet a4.8 achieves performance comparable to BitNet b1.58 with equivalent training costs, while being faster in inference with enabling 4-bit (INT4/FP4) kernels. Additionally, BitNet a4.8 activates only 55% of parameters and supports 3-bit KV cache, further enhancing the efficiency of large-scale LLM deployment and inference.

View Paper