Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam

Tianjin Huang, Haotian Hu, Zhenyu Zhang, Gaojie Jin, Xiang Li, Li Shen, Tianlong Chen, Lu Liu, Qingsong Wen, Zhangyang Wang, Shiwei Liu

2025-02-25

Stable-SPAM: How to Train in 4-Bit More Stably than 16-Bit Adam

Summary

This paper talks about Stable-SPAM, a new way to train large AI language models using much less computer memory while keeping the quality just as good or even better than traditional methods

What's the problem?

Training big AI models uses a lot of computer memory, which is expensive. When researchers try to use less memory by using smaller numbers (4-bit instead of 16-bit), the training process becomes unstable and the AI doesn't learn properly. This is like trying to write a detailed story using only four colors instead of sixteen - it's harder to get all the details right

What's the solution?

The researchers created Stable-SPAM, which is like a smart art teacher for AI. It helps the AI learn better with just four colors by doing three main things: it adjusts how much detail to use based on what it's seen before, it makes sure all parts of the picture are balanced, and it occasionally starts fresh to avoid mistakes piling up. This allows the AI to learn just as well or even better than before, while using much less memory

Why it matters?

This matters because it could make training advanced AI much cheaper and faster. It's like finding a way to teach students just as well in half the time and with fewer resources. This could lead to more people being able to work on AI, potentially speeding up advancements in technology that uses AI, from smartphones to medical research

Abstract

This paper comprehensively evaluates several recently proposed optimizers for 4-bit training, revealing that low-bit precision amplifies sensitivity to learning rates and often causes unstable gradient norms, leading to divergence at higher learning rates. Among these, SPAM, a recent optimizer featuring momentum reset and spike-aware gradient clipping, achieves the best performance across various bit levels, but struggles to stabilize gradient norms, requiring careful learning rate tuning. To address these limitations, we propose Stable-<PRE_TAG>SPAM</POST_TAG>, which incorporates enhanced gradient normalization and clipping techniques. In particular, Stable-<PRE_TAG>SPAM</POST_TAG> (1) adaptively updates the clipping threshold for spiked gradients by tracking their historical maxima; (2) normalizes the entire gradient matrix based on its historical l_2-norm statistics; and (3) inherits momentum reset from SPAM to periodically reset the first and second moments of Adam, mitigating the accumulation of spiked gradients. Extensive experiments show that Stable-<PRE_TAG>SPAM</POST_TAG> effectively stabilizes gradient norms in 4-bit LLM training, delivering superior performance compared to Adam and SPAM. Notably, our 4-bit LLaMA-1B model trained with Stable-<PRE_TAG>SPAM</POST_TAG> outperforms the BF16 LLaMA-1B trained with Adam by up to 2 perplexity. Furthermore, when both models are trained in 4-bit, Stable-<PRE_TAG>SPAM</POST_TAG> achieves the same loss as Adam while requiring only about half the training steps. Code is available at https://github.com/TianjinYellow/StableSPAM.git.

View Paper