Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

Zhenyu Zhang, Ajay Jaiswal, Lu Yin, Shiwei Liu, Jiawei Zhao, Yuandong Tian, Zhangyang Wang

2024-07-13

Q-GaLore: Quantized GaLore with INT4 Projection and Layer-Adaptive Low-Rank Gradients

Summary

This paper introduces Q-GaLore, a new method designed to make training large language models (LLMs) more memory-efficient. It combines techniques like quantization and low-rank projection to reduce the amount of memory needed without sacrificing performance.

What's the problem?

Training LLMs requires a lot of memory because these models have many parameters to optimize. Existing methods, like GaLore, help reduce memory use but can be slow and still require significant resources. They also don't improve performance enough compared to other methods, making it hard for people with less powerful computers to train these models effectively.

What's the solution?

Q-GaLore addresses these issues by using two main strategies: it quantizes the model weights (reducing their size) and uses low-rank projections to manage gradients more efficiently. The method updates the gradient information based on how well the model is learning, which allows it to save memory while still performing well. For example, Q-GaLore can train a large model (LLaMA-7B) on a computer with only 16 GB of memory, which is a significant improvement.

Why it matters?

This research is important because it makes it easier for more people to train large language models, even those with limited computing resources. By improving memory efficiency and maintaining performance, Q-GaLore can help advance AI research and applications in various fields, making powerful AI tools more accessible.

Abstract

Training Large Language Models (LLMs) is memory-intensive due to the large number of parameters and associated optimization states. GaLore, a recent method, reduces memory usage by projecting weight gradients into a low-rank subspace without compromising performance. However, GaLore relies on time-consuming Singular Value Decomposition (SVD) operations to identify the subspace, and the frequent subspace updates lead to significant training time overhead. Moreover, GaLore offers minimal improvements in accuracy and efficiency compared to LoRA in more accessible fine-tuning scenarios. To address these limitations, we introduce Q-Galore, a novel approach that substantially reduces memory usage by combining quantization and low-rank projection, surpassing the benefits of GaLore. Our method is based on two key observations: (i) the gradient subspace exhibits diverse properties, with some layers converging early in training while others are subject to frequent changes; (ii) the projection matrices are highly resilient to low-bit quantization. Leveraging these insights, Q-GaLore adaptively updates the gradient subspace based on its convergence statistics, achieving comparable performance while significantly reducing the number of SVD operations. We maintain the projection matrices in INT4 format and weights in INT8 format, incorporating stochastic rounding to capture accumulated gradient information. This approach enables a high-precision training trajectory using only low-precision weights. We demonstrate that Q-GaLore achieves highly competitive performance with exceptional memory efficiency. At pre-training, Q-GaLore facilitates training a LLaMA-7B model from scratch on a single NVIDIA RTX 4060 Ti with only 16 GB memory. At fine-tuning, it reduces memory consumption by up to 50% compared to LoRA and GaLore, while consistently outperforming QLoRA at the same memory cost.

View Paper