Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

Haiquan Qiu, Quanming Yao

2025-10-09

Why Low-Precision Transformer Training Fails: An Analysis on Flash Attention

Summary

This paper investigates why training powerful AI models called transformers becomes unstable when using simplified, faster calculations with lower precision numbers. It identifies the root cause of a specific failure when using a technique called 'flash attention' with these lower precision numbers and offers a fix.

What's the problem?

When training transformers, researchers are trying to make the process faster and use less memory by using lower precision numbers instead of more accurate ones. However, a common problem arises where the training suddenly goes wrong, causing the model's performance to drastically decrease – essentially, the 'loss' explodes. This specifically happens when using 'flash attention', a technique to speed up a key part of the transformer. The problem wasn't understood; it seemed to happen randomly, and no one knew why lower precision caused this instability.

What's the solution?

The researchers discovered that this instability isn't random. It happens because the attention mechanism within the transformer starts to create very similar, simplified representations of the data. This, combined with the way computers handle rounding errors when using lower precision numbers, creates a cycle where errors build up with each step of the training. To fix this, they made a small change to the 'flash attention' process to reduce the bias in these rounding errors. This simple adjustment stabilized the training and prevented the loss from exploding.

Why it matters?

This work is important because it provides a fundamental understanding of why lower precision training can fail, specifically with 'flash attention'. By identifying the cause – the combination of similar representations and biased rounding – they’ve not only explained a long-standing problem but also provided a practical solution. This allows researchers to continue using faster, more efficient training methods without sacrificing the stability and performance of their AI models, ultimately making it easier and cheaper to build and deploy these powerful systems.

Abstract

The pursuit of computational efficiency has driven the adoption of low-precision formats for training transformer models. However, this progress is often hindered by notorious training instabilities. This paper provides the first mechanistic explanation for a long-standing and unresolved failure case where training with flash attention in low-precision settings leads to catastrophic loss explosions. Our in-depth analysis reveals that the failure is not a random artifact but caused by two intertwined phenomena: the emergence of similar low-rank representations within the attention mechanism and the compounding effect of biased rounding errors inherent in low-precision arithmetic. We demonstrate how these factors create a vicious cycle of error accumulation that corrupts weight updates, ultimately derailing the training dynamics. To validate our findings, we introduce a minimal modification to the flash attention that mitigates the bias in rounding errors. This simple change stabilizes the training process, confirming our analysis and offering a practical solution to this persistent problem.

View Paper