MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration

Lianhai Ren, Yucheng Ding, Xiao Liu, Qianxiao Li, Peng Cheng, Yeyun Gong

2026-02-09

MSign: An Optimizer Preventing Training Instability in Large Language Models via Stable Rank Restoration

Summary

This paper investigates why large language models sometimes fail during their initial training phase, specifically focusing on a problem called 'training instability' where the learning process suddenly goes haywire.

What's the problem?

When training very large language models, a common issue is that the gradients – the signals used to update the model’s knowledge – can suddenly become extremely large, essentially causing the training to crash or become useless. The researchers found that before this happens, two things occur: the 'stable rank' of the weight matrices (which relates to how much information they hold) decreases rapidly, and the layers within the network start to become overly synchronized in how they respond to inputs. They proved mathematically that these two things happening together lead to those exploding gradients.

What's the solution?

To fix this, the researchers developed a new optimization technique called MSign. This technique periodically adjusts the weight matrices in the model using something called a 'matrix sign operation'. This operation essentially restores the 'stable rank' of the matrices, preventing the layers from becoming too synchronized and stopping the gradients from exploding. It's like hitting a reset button to keep the training stable.

Why it matters?

This work is important because training large language models is incredibly expensive and time-consuming. If training frequently fails, it wastes a lot of resources. MSign offers a way to make training more reliable and efficient, allowing researchers to build even more powerful language models without constantly running into crashes, and it does so with a relatively small increase in computational cost.

Abstract

Training instability remains a critical challenge in large language model (LLM) pretraining, often manifesting as sudden gradient explosions that waste significant computational resources. We study training failures in a 5M-parameter NanoGPT model scaled via μP, identifying two key phenomena preceding collapse: (1) rapid decline in weight matrix stable rank (ratio of squared Frobenius norm to squared spectral norm), and (2) increasing alignment between adjacent layer Jacobians. We prove theoretically that these two conditions jointly cause exponential gradient norm growth with network depth. To break this instability mechanism, we propose MSign, a new optimizer that periodically applies matrix sign operations to restore stable rank. Experiments on models from 5M to 3B parameters demonstrate that MSign effectively prevents training failures with a computational overhead of less than 7.0%.

View Paper