Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients

Aashiq Muhamed, Oscar Li, David Woodruff, Mona Diab, Virginia Smith

2024-06-26

Grass: Compute Efficient Low-Memory LLM Training with Structured Sparse Gradients

Summary

This paper introduces Grass, a new method designed to make training large language models (LLMs) more efficient by using structured sparse gradients. This approach helps reduce the memory needed for training these models, making it possible to train larger models on less powerful hardware.

What's the problem?

Training large language models often requires a lot of memory, which can be a problem when using GPUs that have limited memory capacity. Existing methods to reduce memory usage usually rely on dense projection matrices, which can still take up a lot of space and slow down the training process. This makes it challenging to train very large models efficiently.

What's the solution?

Grass tackles this problem by using sparse projections instead of dense ones. This means it transforms the gradients (which are used to update the model during training) into structured sparse updates that take up less memory. By doing this, Grass not only reduces the amount of memory needed but also cuts down on the computation and communication costs during training. The authors tested Grass on various tasks and found that it performed comparably to traditional methods while allowing for faster training times and enabling the use of larger models on single GPUs.

Why it matters?

This research is important because it opens up new possibilities for training large language models more efficiently, which can lead to advancements in artificial intelligence applications. By making it easier and cheaper to train these models, Grass can help researchers and developers create better AI systems without needing extremely powerful hardware.

Abstract

Large language model (LLM) training and finetuning are often bottlenecked by limited GPU memory. While existing projection-based optimization methods address this by projecting gradients into a lower-dimensional subspace to reduce optimizer state memory, they typically rely on dense projection matrices, which can introduce computational and memory overheads. In this work, we propose Grass (GRAdient Stuctured Sparsification), a novel approach that leverages sparse projections to transform gradients into structured sparse updates. This design not only significantly reduces memory usage for optimizer states but also minimizes gradient memory footprint, computation, and communication costs, leading to substantial throughput improvements. Extensive experiments on pretraining and finetuning tasks demonstrate that Grass achieves competitive performance to full-rank training and existing projection-based methods. Notably, Grass enables half-precision pretraining of a 13B parameter LLaMA model on a single 40GB A100 GPU--a feat infeasible for previous methods--and yields up to a 2times throughput improvement on an 8-GPU system. Code can be found at https://github.com/aashiqmuhamed/GRASS .

View Paper