Memory-Efficient LLM Training with Online Subspace Descent

Kaizhao Liang, Bo Liu, Lizhang Chen, Qiang Liu

2024-08-26

Memory-Efficient LLM Training with Online Subspace Descent

Summary

This paper discusses a new method for training large language models (LLMs) more efficiently, focusing on reducing the memory needed for calculations while maintaining performance.

What's the problem?

Training LLMs can be very demanding in terms of memory and processing power. Traditional methods often struggle because they rely on complex calculations that take a lot of time and resources, making it hard to train these models effectively.

What's the solution?

The authors introduce a method called Online Subspace Descent, which improves how gradients (the values needed to update the model during training) are calculated. Instead of using a complicated process called singular value decomposition (SVD), they use online principal component analysis (PCA) to update the model's parameters more flexibly and quickly. This new approach allows for faster training with less memory usage while still achieving good results.

Why it matters?

This research is important because it helps make the training of large language models more efficient, which can lead to faster development and deployment of AI technologies. By reducing the resources needed for training, it opens up opportunities for more researchers and developers to work with these powerful models, ultimately advancing the field of artificial intelligence.

Abstract

Recently, a wide range of memory-efficient LLM training algorithms have gained substantial popularity. These methods leverage the low-rank structure of gradients to project optimizer states into a subspace using projection matrix found by singular value decomposition (SVD). However, convergence of these algorithms is highly dependent on the update rules of their projection matrix. In this work, we provide the first convergence guarantee for arbitrary update rules of projection matrix. This guarantee is generally applicable to optimizers that can be analyzed with Hamiltonian Descent, including most common ones, such as LION, Adam. Inspired by our theoretical understanding, we propose Online Subspace Descent, a new family of subspace descent optimizer without SVD. Instead of updating the projection matrix with eigenvectors, Online Subspace Descent updates the projection matrix with online PCA. Online Subspace Descent is flexible and introduces only minimum overhead to training. We show that for the task of pretraining LLaMA models ranging from 60M to 7B parameters on the C4 dataset, Online Subspace Descent achieves lower perplexity and better downstream tasks performance than state-of-the-art low-rank training methods across different settings and narrows the gap with full-rank baselines.

View Paper