A Refined Analysis of Massive Activations in LLMs

Louis Owen, Nilabhra Roy Chowdhury, Abhay Kumar, Fabian Güra

2025-03-31

A Refined Analysis of Massive Activations in LLMs

Summary

This paper is about understanding and fixing a problem in large AI language models where some parts of the model become too active, which can cause problems.

What's the problem?

Large language models can have parts that become overly active, which can negatively affect their performance and make them harder to train.

What's the solution?

The researchers analyzed this problem across different AI models and tested new strategies to fix it, finding that some previously suggested solutions don't always work and that a combination of approaches is often needed.

Why it matters?

This work matters because it can help make large language models more stable, efficient, and reliable.

Abstract

Motivated in part by their relevance for low-precision training and quantization, massive activations in large language models (LLMs) have recently emerged as a topic of interest. However, existing analyses are limited in scope, and generalizability across architectures is unclear. This paper helps address some of these gaps by conducting an analysis of massive activations across a broad range of LLMs, including both GLU-based and non-GLU-based architectures. Our findings challenge several prior assumptions, most importantly: (1) not all massive activations are detrimental, i.e. suppressing them does not lead to an explosion of perplexity or a collapse in downstream task performance; (2) proposed mitigation strategies such as Attention KV bias are model-specific and ineffective in certain cases. We consequently investigate novel hybrid mitigation strategies; in particular pairing Target Variance Rescaling (TVR) with Attention KV bias or Dynamic Tanh (DyT) successfully balances the mitigation of massive activations with preserved downstream model performance in the scenarios we investigated. Our code is available at: https://github.com/bluorion-com/refine_massive_activations.

View Paper