The Curse of Depth in Large Language Models
Wenfang Sun, Xinyuan Song, Pengxiang Li, Lu Yin, Yefeng Zheng, Shiwei Liu
2025-02-11
Summary
This paper talks about the 'Curse of Depth,' a problem in large language models (LLMs) where the deeper layers of the model don't work as effectively as they should, and introduces a solution called LayerNorm Scaling to fix this issue.
What's the problem?
In many LLMs, the deeper layers contribute much less to learning and representation compared to the earlier layers. This happens because a technique called Pre-Layer Normalization (Pre-LN), used to stabilize training, causes the outputs of deeper layers to grow too large. As a result, these layers barely help during training, making the model less efficient.
What's the solution?
The researchers proposed LayerNorm Scaling, which adjusts how much the outputs of each layer grow based on their depth. This simple change prevents the deeper layers from becoming ineffective and allows them to contribute more during training. They tested this on models ranging from 130 million to 1 billion parameters and found that it significantly improved performance during both pre-training and fine-tuning.
Why it matters?
This matters because it makes LLMs more efficient and effective by ensuring all layers contribute properly. Fixing this issue can lead to better-performing AI models without needing extra resources, which is important for advancing AI technology in a cost-effective way.
Abstract
In this paper, we introduce the Curse of Depth, a concept that highlights, explains, and addresses the recent observation in modern Large Language Models(LLMs) where nearly half of the layers are less effective than expected. We first confirm the wide existence of this phenomenon across the most popular families of LLMs such as Llama, Mistral, DeepSeek, and Qwen. Our analysis, theoretically and empirically, identifies that the underlying reason for the ineffectiveness of deep layers in LLMs is the widespread usage of Pre-Layer Normalization (Pre-LN). While Pre-LN stabilizes the training of Transformer LLMs, its output variance exponentially grows with the model depth, which undesirably causes the derivative of the deep Transformer blocks to be an identity matrix, and therefore barely contributes to the training. To resolve this training pitfall, we propose LayerNorm Scaling, which scales the variance of output of the layer normalization inversely by the square root of its depth. This simple modification mitigates the output variance explosion of deeper Transformer layers, improving their contribution. Our experimental results, spanning model sizes from 130M to 1B, demonstrate that LayerNorm Scaling significantly enhances LLM pre-training performance compared to Pre-LN. Moreover, this improvement seamlessly carries over to supervised fine-tuning. All these gains can be attributed to the fact that LayerNorm Scaling enables deeper layers to contribute more effectively during training.