GPAS: Accelerating Convergence of LLM Pretraining via Gradient-Preserving Activation Scaling
Tianhao Chen, Xin Xu, Zijing Liu, Pengxiang Li, Xinyuan Song, Ajay Kumar Jaiswal, Fan Zhang, Jishan Hu, Yang Wang, Hao Chen, Shizhe Diao, Shiwei Liu, Yu Li, Yin Lu, Can Yang
2025-06-30
Summary
This paper talks about Gradient-Preserving Activation Scaling (GPAS), a new technique that makes training large language models faster and more efficient by fixing problems with how information flows through the model's layers.
What's the problem?
During training, some parts of the model can become too strong and overpower others, which stops deeper layers from learning properly and slows down the overall learning process.
What's the solution?
GPAS works by carefully scaling down certain numbers inside the model while keeping important learning signals, called gradients, the same. This helps balance the model’s layers so all parts can learn effectively, allowing the model to train faster and perform better.
Why it matters?
This matters because faster and more efficient training saves time and computing resources, making it easier to develop bigger and smarter AI models that can do more complex tasks.
Abstract
Gradient-Preserving Activation Scaling (GPAS) mitigates activation variance issues in Pre-LayerNorm Transformers and enhances training dynamics across different architectures.