Stronger Normalization-Free Transformers

Mingzhi Chen, Taiming Lu, Jiachen Zhu, Mingjie Sun, Zhuang Liu

2025-12-12

Stronger Normalization-Free Transformers

Summary

This research explores alternatives to common normalization layers used in deep learning, ultimately proposing a new function called Derf that performs better than existing methods.

What's the problem?

Normalization layers are usually considered essential for training deep learning models, helping them learn effectively. However, a recent function called Dynamic Tanh (DyT) showed that it's possible to achieve good results *without* traditional normalization. The problem this paper addresses is whether we can find an even better function than DyT to replace normalization layers and improve model performance.

What's the solution?

The researchers started by investigating what characteristics make a function good for this purpose. They then used a large-scale search to test many different function designs. Through this process, they discovered that a function based on the error function, called Derf(x) = erf(αx + s), consistently outperformed DyT, LayerNorm, and RMSNorm across various tasks like image recognition, speech processing, and DNA sequence modeling. They found that Derf's success comes from its ability to generalize well to new data, not just memorizing the training data.

Why it matters?

This work is important because it offers a simpler and more effective alternative to normalization layers, which can be computationally expensive and complex. Derf's strong performance and simplicity make it a practical choice for building powerful deep learning models, especially Transformer architectures, without relying on traditional normalization techniques.

Abstract

Although normalization layers have long been viewed as indispensable components of deep learning architectures, the recent introduction of Dynamic Tanh (DyT) has demonstrated that alternatives are possible. The point-wise function DyT constrains extreme values for stable convergence and reaches normalization-level performance; this work seeks further for function designs that can surpass it. We first study how the intrinsic properties of point-wise functions influence training and performance. Building on these findings, we conduct a large-scale search for a more effective function design. Through this exploration, we introduce Derf(x) = erf(αx + s), where erf(x) is the rescaled Gaussian cumulative distribution function, and identify it as the most performant design. Derf outperforms LayerNorm, RMSNorm, and DyT across a wide range of domains, including vision (image recognition and generation), speech representation, and DNA sequence modeling. Our findings suggest that the performance gains of Derf largely stem from its improved generalization rather than stronger fitting capacity. Its simplicity and stronger performance make Derf a practical choice for normalization-free Transformer architectures.

View Paper