Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN

Pengxiang Li, Lu Yin, Shiwei Liu

2024-12-19

Mix-LN: Unleashing the Power of Deeper Layers by Combining Pre-LN and Post-LN

Summary

This paper talks about Mix-LN, a new normalization technique that combines Pre-Layer Normalization and Post-Layer Normalization to improve the training of large language models (LLMs).

What's the problem?

Many large language models, like GPT and LLaMA, often struggle with their deeper layers not contributing much to overall performance. This is due to the use of Pre-Layer Normalization (Pre-LN), which can lead to weaker learning in these layers. As a result, some layers can be less effective, causing issues during training and making it hard for the models to learn complex tasks.

What's the solution?

To solve this problem, the authors propose Mix-LN, which applies Post-Layer Normalization (Post-LN) to the earlier layers of the model and Pre-LN to the deeper layers. This combination helps maintain stronger learning signals (gradients) across all layers, ensuring that both shallow and deep layers can effectively contribute to training. The authors conducted experiments showing that Mix-LN outperforms both Pre-LN and Post-LN alone in various tasks, leading to better overall performance without increasing the size of the model.

Why it matters?

This research is important because it enhances how large language models are trained, making them more efficient and capable of handling complex tasks. By improving the effectiveness of deep layers in these models, Mix-LN can help advance AI technologies in areas like natural language processing, machine translation, and more, ultimately leading to better AI applications.

Abstract

Large Language Models (LLMs) have achieved remarkable success, yet recent findings reveal that their deeper layers often contribute minimally and can be pruned without affecting overall performance. While some view this as an opportunity for model compression, we identify it as a training shortfall rooted in the widespread use of Pre-Layer Normalization (Pre-LN). We demonstrate that Pre-LN, commonly employed in models like GPT and LLaMA, leads to diminished gradient norms in its deeper layers, reducing their effectiveness. In contrast, Post-Layer Normalization (Post-LN) preserves larger gradient norms in deeper layers but suffers from vanishing gradients in earlier layers. To address this, we introduce Mix-LN, a novel normalization technique that combines the strengths of Pre-LN and Post-LN within the same model. Mix-LN applies Post-LN to the earlier layers and Pre-LN to the deeper layers, ensuring more uniform gradients across layers. This allows all parts of the network--both shallow and deep layers--to contribute effectively to training. Extensive experiments with various model sizes from 70M to 7B demonstrate that Mix-LN consistently outperforms both Pre-LN and Post-LN, promoting more balanced, healthier gradient norms throughout the network, and enhancing the overall quality of LLM pre-training. Furthermore, we demonstrate that models pre-trained with Mix-LN learn better compared to those using Pre-LN or Post-LN during supervised fine-tuning (SFT) and reinforcement learning from human feedback (RLHF), highlighting the critical importance of high-quality deep layers. By effectively addressing the inefficiencies of deep layers in current LLMs, Mix-LN unlocks their potential, enhancing model capacity without increasing model size. Our code is available at https://github.com/pixeli99/MixLN.

View Paper