SPINAL -- Scaling-law and Preference Integration in Neural Alignment Layers
Arion Das, Partha Pratim Saha, Amit Dhanda, Vinija Jain, Aman Chadha, Amitava Das
2026-01-13
Summary
This paper investigates how large language models change internally when they're 'aligned' to better follow human preferences, specifically using a technique called Direct Preference Optimization (DPO). It aims to understand *where* and *how* these changes happen within the model's layers.
What's the problem?
When we try to make AI models behave the way we want (alignment), we often don't know exactly *how* the model is changing internally to achieve that. It's like adjusting a complex machine without seeing the gears – we know it works, but not *why*. This makes it hard to compare different aligned models, predict when they might fail, or even fully trust them. Existing methods don't give us a clear picture of the internal adjustments made during alignment.
What's the solution?
The researchers developed a tool called SPINAL, which essentially creates a 'depth trace' of the model. It looks at each layer of the model and measures two things: how much the information gets compressed (contraction score) and how much the model's understanding of text shifts between layers (transport score). They found that DPO primarily affects the very last layers of the model, making them more focused and stable. Aligned models show a clear pattern of increasing compression and smoother shifts in understanding in these final layers, while unaligned models are more chaotic.
Why it matters?
Understanding the internal changes during alignment is crucial for building safer and more reliable AI. SPINAL provides a way to 'audit' these models, checking where the alignment is happening, how strong it is, and if it's becoming unstable during training. This allows developers to better compare models, diagnose problems, and ultimately build AI systems that are more aligned with human values and expectations.
Abstract
Direct Preference Optimization (DPO) is a principled, scalable alternative to RLHF for aligning large language models from pairwise preferences, but its internal geometric footprint remains undercharacterized, limiting audits, checkpoint comparisons, and failure prediction. We introduce SPINAL (Scaling-law and Preference Integration in Neural Alignment Layers), a diagnostic that measures how alignment reshapes representations across depth by tracing localized structural change layer by layer. Across model families, DPO produces a layerwise calibration effect concentrated in the final decoder blocks (often layers 21-30), where preference gradients most directly affect the next-token distribution. SPINAL encodes each checkpoint as a depth trace over (layer index, contraction score, transport score). The contraction score summarizes how quickly the tail of a layer's spectrum decays (how fast small modes vanish); higher values indicate stronger contraction into fewer effective directions. The transport score summarizes how much the token distribution shifts between adjacent layers using a bounded overlap measure; lower values indicate shorter, smoother steps through representation space. Aligned checkpoints show a late-layer ramp-up in contraction and a smooth reduction in transport, consistent with tightened and stabilized policy mass, while unaligned models trace higher-curvature, more entropic, and geometrically incoherent depth paths. Overall, alignment is geometrically localized: the final layers encode the dominant preference-induced corrections. SPINAL turns this localization into a practical audit signal, quantifying where alignment concentrates, how strongly it manifests, and when it begins to destabilize during training.