ReLU's Revival: On the Entropic Overload in Normalization-Free Large Language Models
Nandan Kumar Jha, Brandon Reagen
2024-10-15

Summary
This paper discusses the advantages of using the ReLU activation function over GELU in large language models (LLMs) that do not use Layer Normalization, showing that ReLU can improve performance in certain situations.
What's the problem?
Large language models often use Layer Normalization to help stabilize their training and improve performance. However, this can create challenges in understanding how the models work internally and can lead to inefficiencies. The commonly used GELU activation function, while smooth and effective, may not perform as well in models without Layer Normalization due to issues like entropic overload in early layers, which means the model isn't fully utilizing its potential.
What's the solution?
The authors propose using ReLU instead of GELU for normalization-free LLMs. Their research shows that ReLU significantly outperforms GELU by improving perplexity (a measure of how well a model predicts a sample) by 8.2%. They explain that ReLU's simpler structure allows for better learning dynamics and retention of information compared to GELU in these specific models.
Why it matters?
This research is important because it challenges the conventional preference for GELU in transformer models, suggesting that ReLU may be a better choice for certain architectures. By optimizing activation functions based on the model's design, this work can lead to more efficient and effective AI systems, enhancing their ability to understand and generate language.
Abstract
LayerNorm is a critical component in modern large language models (LLMs) for stabilizing training and ensuring smooth optimization. However, it introduces significant challenges in mechanistic interpretability, outlier feature suppression, faithful signal propagation, and computational and communication complexity of private inference. This work explores desirable activation functions in normalization-free decoder-only LLMs. Contrary to the conventional preference for the GELU in transformer-based models, our empirical findings demonstrate an {\em opposite trend} -- ReLU significantly outperforms GELU in LayerNorm-free models, leading to an {\bf 8.2\%} perplexity improvement. We discover a key issue with GELU, where early layers experience entropic overload, leading to the under-utilization of the representational capacity of attention heads. This highlights that smoother activations like GELU are {\em ill-suited} for LayerNorm-free architectures, whereas ReLU's geometrical properties -- specialization in input space and intra-class selectivity -- lead to improved learning dynamics and better information retention in the absence of LayerNorm. This study offers key insights for optimizing transformer architectures where LayerNorm introduces significant challenges.