SageAttention3: Microscaling FP4 Attention for Inference and An Exploration of 8-Bit Training

Jintao Zhang, Jia Wei, Pengle Zhang, Xiaoming Xu, Haofeng Huang, Haoxu Wang, Kai Jiang, Jun Zhu, Jianfei Chen

2025-05-21

SageAttention3: Microscaling FP4 Attention for Inference and An
Exploration of 8-Bit Training

Summary

This paper talks about SageAttention3, a new way to make AI models that use attention mechanisms work faster and use less computer power by using special techniques with lower-precision math.

What's the problem?

The problem is that attention mechanisms, which help AI models focus on important parts of information, usually need a lot of computer resources and can be slow, especially as models get bigger and more complex.

What's the solution?

To fix this, the researchers used new hardware features called FP4 Tensor Cores and created an 8-bit attention method, which lets the AI do its calculations with fewer bits. This makes both training and running the models much faster and more efficient without losing much accuracy.

Why it matters?

This matters because it means advanced AI can be run on cheaper and less powerful computers, making these technologies more accessible and practical for more people and companies.

Abstract

Efficiency enhancements for attention mechanisms, including leveraging FP4 Tensor Cores and developing an 8-bit attention method, improve inference and training performance.

View Paper