SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training

Tianjin Huang, Ziquan Zhu, Gaojie Jin, Lu Liu, Zhangyang Wang, Shiwei Liu

2025-01-14

SPAM: Spike-Aware Adam with Momentum Reset for Stable LLM Training

Summary

This paper talks about a new way to train large AI language models called SPAM (Spike-Aware Adam with Momentum Reset). It's designed to make the training process more stable and efficient by dealing with sudden spikes in the learning process that can cause problems.

What's the problem?

When training big AI language models, there are sometimes sudden jumps or 'spikes' in the learning process. These spikes can be up to 1000 times bigger than normal and can mess up the training, making the AI perform worse. This leads to wasted time and resources because the training often has to be stopped and restarted.

What's the solution?

The researchers created SPAM, a new method for training AI. SPAM works by watching out for these spikes and adjusting the training process when they happen. It does this by resetting part of the learning process (called momentum) and carefully controlling how much the AI changes in response to new information. They tested SPAM on different types of AI tasks and found that it works better than other popular training methods.

Why it matters?

This matters because it could make training big AI models faster, cheaper, and more reliable. SPAM uses computer memory more efficiently, which is really important when working with huge AI models. By making the training process more stable, it could help create better AI models without needing as much computing power. This could lead to more advanced AI being developed more quickly and by more researchers, potentially speeding up progress in the field of artificial intelligence.

Abstract

Large Language Models (LLMs) have demonstrated exceptional performance across diverse tasks, yet their training remains highly resource-intensive and susceptible to critical challenges such as training instability. A predominant source of this instability stems from gradient and loss spikes, which disrupt the learning process, often leading to costly interventions like checkpoint recovery and experiment restarts, further amplifying inefficiencies. This paper presents a comprehensive investigation into gradient spikes observed during LLM training, revealing their prevalence across multiple architectures and datasets. Our analysis shows that these spikes can be up to 1000times larger than typical gradients, substantially deteriorating model performance. To address this issue, we propose Spike-Aware Adam with Momentum Reset SPAM, a novel optimizer designed to counteract gradient spikes through momentum reset and spike-aware gradient clipping. Extensive experiments, including both pre-training and fine-tuning, demonstrate that SPAM consistently surpasses Adam and its variants across various tasks, including (1) LLM pre-training from 60M to 1B, (2) 4-bit LLM pre-training,(3) reinforcement learning, and (4) Time Series Forecasting. Additionally, SPAM facilitates memory-efficient training by enabling sparse momentum, where only a subset of momentum terms are maintained and updated. When operating under memory constraints, SPAM outperforms state-of-the-art memory-efficient optimizers such as GaLore and Adam-Mini. Our work underscores the importance of mitigating gradient spikes in LLM training and introduces an effective optimization strategy that enhances both training stability and resource efficiency at scale. Code is available at https://github.com/TianjinYellow/SPAM-Optimizer.git

View Paper