L^2M: Mutual Information Scaling Law for Long-Context Language Modeling

Zhuo Chen, Oriol Mayné i Comas, Zhuotao Jin, Di Luo, Marin Soljačić

2025-03-07

L^2M: Mutual Information Scaling Law for Long-Context Language
Modeling

Summary

This paper talks about a new way to understand how information flows through long pieces of text, which helps improve AI language models that work with large amounts of text

What's the problem?

Current AI language models struggle to understand and remember information from very long texts, like entire books or long conversations. We don't fully understand why this happens or how to fix it

What's the solution?

The researchers found a new mathematical rule called the 'bipartite mutual information scaling law' that describes how information connects across long texts. They used this to create a guideline called the L^2M condition, which tells us how much memory an AI model needs to understand long texts properly

Why it matters?

This matters because it helps us build better AI that can understand and work with longer pieces of text, like entire books or long conversations. This could lead to smarter AI assistants, better automatic summarization tools, and improved language translation for long documents

Abstract

We rigorously establish a bipartite mutual information scaling law in natural language that governs long-range dependencies. This scaling law, which we show is distinct from and scales independently of the conventional two-point mutual information, is the key to understanding long-context language modeling. Using this scaling law, we formulate the Long-context Language Modeling (L^2M) condition, which relates a model's capacity for effective long context length modeling to the scaling of its latent state size for storing past information. Our results are validated through experiments on both transformers and state space models. This work establishes a theoretical foundation that guides the development of large language models toward longer context lengths.

View Paper