Adam-mini: Use Fewer Learning Rates To Gain More

Yushun Zhang, Congliang Chen, Ziniu Li, Tian Ding, Chenwei Wu, Yinyu Ye, Zhi-Quan Luo, Ruoyu Sun

2024-06-27

Adam-mini: Use Fewer Learning Rates To Gain More

Summary

This paper introduces Adam-mini, a new optimizer designed to improve the efficiency of training large language models (LLMs) by using fewer learning rates, which helps reduce memory usage while maintaining or improving performance compared to existing optimizers like AdamW.

What's the problem?

Training large language models requires a lot of memory, especially when using traditional optimizers like Adam. These optimizers need to store many learning rates, which can take up significant space and slow down the training process. This makes it difficult for researchers with limited resources to train large models effectively.

What's the solution?

Adam-mini addresses this problem by reducing the number of learning rates used during training. The authors found that more than 90% of the learning rates in the traditional Adam optimizer can be removed without harming performance. They achieve this by dividing the model's parameters into smaller groups (or blocks) and assigning a single high-quality learning rate to each block. This method not only cuts down on memory usage by 45% to 50% but also allows for faster training times. The researchers tested Adam-mini on various models and found that it performed as well as or better than AdamW while using less memory.

Why it matters?

This research is important because it makes it easier and cheaper to train large language models. By reducing memory requirements and improving training speed, Adam-mini allows more researchers to work with advanced AI technologies, which can lead to further innovations in fields like natural language processing, machine learning, and artificial intelligence.

Abstract

We propose Adam-mini, an optimizer that achieves on-par or better performance than AdamW with 45% to 50% less memory footprint. Adam-mini reduces memory by cutting down the learning rate resources in Adam (i.e., 1/v). We find that geq 90% of these learning rates in v could be harmlessly removed if we (1) carefully partition the parameters into blocks following our proposed principle on Hessian structure; (2) assign a single but good learning rate to each parameter block. We further find that, for each of these parameter blocks, there exists a single high-quality learning rate that can outperform Adam, provided that sufficient resources are available to search it out. We then provide one cost-effective way to find good learning rates and propose Adam-mini. Empirically, we verify that Adam-mini performs on par or better than AdamW on various language models sized from 125M to 7B for pre-training, supervised fine-tuning, and RLHF. The reduced memory footprint of Adam-mini also alleviates communication overheads among GPUs and CPUs, thereby increasing throughput. For instance, Adam-mini achieves 49.6% higher throughput than AdamW when pre-training Llama2-7B on 2times A800-80GB GPUs, which saves 33% wall-clock time for pre-training.

View Paper