From GaLore to WeLore: How Low-Rank Weights Non-uniformly Emerge from Low-Rank Gradients
Ajay Jaiswal, Lu Yin, Zhenyu Zhang, Shiwei Liu, Jiawei Zhao, Yuandong Tian, Zhangyang Wang
2024-07-17

Summary
This paper discusses WeLore, a new method for improving the efficiency of large language models (LLMs) by using low-rank weight structures to reduce memory usage and enhance training.
What's the problem?
Large language models are made up of huge matrices with billions of elements, which require a lot of memory and processing power. Finding the best way to initialize and manage these weights is challenging, and if not done properly, it can lead to slow training and poor performance. Previous methods often focused on uniform approaches that didn't consider the differences between layers in the model.
What's the solution?
WeLore addresses these issues by studying how low-rank structures emerge in the weight matrices of different layers during training. It categorizes these weights into two types: Low-rank Components (LRCs), which can be efficiently compressed, and Non-Low-rank Components (N-LRCs), which cannot. By focusing on LRCs for fine-tuning, WeLore allows for significant reductions in memory usage while maintaining or even improving performance. This method is efficient because it requires less data and can be implemented quickly.
Why it matters?
This research is important because it helps make large language models more practical to use by reducing their resource requirements without sacrificing performance. By optimizing how these models are trained and deployed, WeLore could lead to faster and more efficient AI applications in various fields, making advanced technology more accessible.
Abstract
Modern Large Language Models (LLMs) are composed of matrices with billions of elements, making their storage and processing quite demanding in terms of computational resources and memory usage. Being significantly large, such matrices can often be expressed in low-rank format with potential to relax resource requirements. Unlike prior works which focus on developing novel matrix decomposition algorithms, in this work we first study the emergence of low-rank structures across matrices within different layers of LLMs and establish a consequential relationship between the gradient dynamics and emerging low-rank expressiveness of matrices. Our findings reveal that different layers exhibit varying levels of converged low-rank structure, necessitating a non-uniform rank reduction across them to minimize performance drop due to compression. In view of that, we present Weight Low-Rank Projection (WeLore) that unifies weight compression and memory-efficient fine-tuning as ONE, in a data-agnostic and one-shot way. WeLore capitalizes the heavy-tail distribution of singular values to identify a suitable rank reduction ratio for matrices within LLMs. Going beyond only as a compression technique, WeLore categorizes weight matrices into Low-rank Components (LRCs) and Non-Low-rank Components (N-LRCs) based on their ability to express themselves as low-rank. Our gradient perspective and extensive experiments illustrate that LRCs tend to have better finetuning capabilities and can closely mimic (sometimes outperform) the training loss trajectory and performance of full-finetuning with notable memory and compute footprint reduction. For example, finetuning a 50\% compressed LLaMa-2 7B model using only a fraction of parameters in LRCs (WeLore) can outperform its full finetuning with ~3x better throughput and ~0.6x GPU requirement. Our codes are available at https://github.com/VITA-Group/welore