Post-LayerNorm Is Back: Stable, ExpressivE, and Deep

Chen Chen, Lai Wei

2026-01-28

Post-LayerNorm Is Back: Stable, ExpressivE, and Deep

Summary

This paper investigates why making large language models bigger isn't always better, and proposes a way to build much deeper models that could be significantly more powerful.

What's the problem?

Currently, simply increasing the size of language models or the amount of text they can consider at once isn't leading to huge improvements. A promising approach is to make the models 'deeper' – meaning more layers – but traditional methods for building deep models, specifically those using a 'Post-LayerNorm' setup, become unstable and difficult to train as they get very deep. The issue stems from how information flows through the layers, causing signals to weaken and gradients to vanish, preventing effective learning in the earlier layers.

What's the solution?

The researchers revisited the 'Post-LayerNorm' method, which was previously abandoned due to instability. They discovered the problem wasn't the method itself, but the way information was passed between layers using a standard 'ResNet' style connection. They replaced this connection with a 'Highway' connection, which allows information to flow more easily through the network, even in very deep models. This 'Keel' architecture allows for stable training of models with over 1000 layers without needing complicated tricks or special starting conditions.

Why it matters?

This work is important because it shows that building extremely deep language models is possible and potentially very beneficial. By fixing the instability issues with 'Post-LayerNorm', it opens the door to creating models that are fundamentally more expressive and capable than current architectures, potentially leading to a new generation of AI with improved performance and understanding.

Abstract

Large language model (LLM) scaling is hitting a wall. Widening models yields diminishing returns, and extending context length does not improve fundamental expressivity. In contrast, depth scaling offers theoretically superior expressivity, yet current Transformer architectures struggle to train reliably at extreme depths. We revisit the Post-LayerNorm (Post-LN) formulation, whose instability at scale caused its replacement by Pre-LN in modern LLMs. We show that the central failure mode of Post-LN arises from the ResNet-style residual pathway, which introduces gradient vanishing in deep networks. We present Keel, a Post-LN Transformer that replaces this residual path with a Highway-style connection. This modification preserves the gradient flow through the residual branch, preventing signal vanishing from the top layers to the bottom. Unlike prior methods, Keel enables stable training at extreme depths without requiring specialized initialization or complex optimization tricks. Keel trains robustly at depths exceeding 1000 layers and consistently improves perplexity and depth-scaling characteristics over Pre-LN. These findings indicate that Post-LN, when paired with a Highway-style connection, provides a simple and effective foundation for building deeply scalable LLMs, opening the possibility for future infinite-depth architectures.

View Paper