Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models

Dongwon Jo, Taesu Kim, Yulhwa Kim, Jae-Joon Kim

2024-06-19

Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models

Summary

This paper introduces a new method called Mixture of Scales (BinaryMoS) for improving the way large language models (LLMs) are compressed. It focuses on a technique called binarization, which simplifies the model by converting its weight parameters into binary values to save space without losing too much performance.

What's the problem?

While binarization is effective for reducing the size of LLMs, traditional methods often lead to a decrease in the model's ability to understand and generate language effectively. This means that although the models become smaller and easier to use, they might not perform as well in tasks like text generation or comprehension. As a result, there is a need for better binarization techniques that can maintain the linguistic effectiveness of these models.

What's the solution?

To solve this problem, the authors developed BinaryMoS, which uses multiple scaling experts for each token in the model. Instead of applying a single scaling factor across the board, BinaryMoS dynamically merges these experts based on the context of each token. This means that it can adjust how weights are represented depending on what the model is processing at any given time. The process focuses on scaling factors rather than altering the entire weight matrix, which helps retain efficiency while improving performance. The results showed that BinaryMoS outperformed traditional binarization methods and even some more complex quantization techniques while keeping a similar model size.

Why it matters?

This research is important because it offers a way to make large language models more efficient without sacrificing their ability to understand and generate language. By improving how these models are compressed, BinaryMoS could enable their use in devices with limited resources, making advanced AI technology more accessible. This advancement has significant implications for various applications, including chatbots, translation services, and other tools that rely on natural language processing.

Abstract

Binarization, which converts weight parameters to binary values, has emerged as an effective strategy to reduce the size of large language models (LLMs). However, typical binarization techniques significantly diminish linguistic effectiveness of LLMs. To address this issue, we introduce a novel binarization technique called Mixture of Scales (BinaryMoS). Unlike conventional methods, BinaryMoS employs multiple scaling experts for binary weights, dynamically merging these experts for each token to adaptively generate scaling factors. This token-adaptive approach boosts the representational power of binarized LLMs by enabling contextual adjustments to the values of binary weights. Moreover, because this adaptive process only involves the scaling factors rather than the entire weight matrix, BinaryMoS maintains compression efficiency similar to traditional static binarization methods. Our experimental results reveal that BinaryMoS surpasses conventional binarization techniques in various natural language processing tasks and even outperforms 2-bit quantization methods, all while maintaining similar model size to static binarization techniques.

View Paper