Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs

Bo Bai

2025-11-05

Forget BIT, It is All about TOKEN: Towards Semantic Information Theory for LLMs

Summary

This paper attempts to build a theoretical understanding of how large language models, or LLMs, actually work, moving beyond just observing their impressive abilities. It uses ideas from information theory – a field that deals with quantifying information – to explain the inner workings of these models.

What's the problem?

Currently, LLMs are largely 'black boxes'. We know they can generate text, translate languages, and answer questions, but we don't have a solid theoretical framework to explain *why* they're so good at these things. Developing and improving these models requires huge amounts of computing power and data, so understanding the fundamental principles behind them is crucial to make progress more efficiently and effectively.

What's the solution?

The researchers started with established concepts like rate-distortion theory, directed information, and Granger causality. They then adapted these ideas to focus on 'tokens' – the individual words or parts of words that LLMs process – instead of just 'bits' of data. This led to a new 'semantic information theory' specifically for LLMs. They developed mathematical measures to analyze how information flows through the model during training and when it's actually generating text, and even explored how to best represent tokens as numbers (embeddings). Finally, they created a general mathematical definition of how LLMs work, which can be applied to different model architectures like Transformers, Mamba, and LLaDA.

Why it matters?

This work is important because it provides a theoretical foundation for understanding LLMs. Having this framework allows researchers to analyze, improve, and potentially design even better language models. It moves the field beyond simply trial and error, offering tools to predict performance, understand limitations, and guide future development in a more informed way.

Abstract

Large language models (LLMs) have demonstrated remarkable capabilities in numerous real-world applications. While the vast majority of research conducted from an experimental perspective is progressing rapidly, it demands substantial computational power, data, and other resources. Therefore, how to open the black-box of LLMs from a theoretical standpoint has become a critical challenge. This paper takes the theory of rate-distortion function, directed information, and Granger causality as its starting point to investigate the information-theoretic principles behind LLMs, leading to the development of semantic information theory for LLMs, where the fundamental unit is token, rather than bits that lacks any semantic meaning. By defining the probabilistic model of LLMs, we discuss structure-agnostic information-theoretic measures, such as the directed rate-distortion function in pre-training, the directed rate-reward function in post-training, and the semantic information flow in inference phase. This paper also delves deeply into the theory of token-level semantic embedding and the information-theoretically optimal vectorization method. Thereafter, we propose a general definition of autoregression LLM, where the Transformer architecture and its performance such as ELBO, generalization error bound, memory capacity, and semantic information measures can be derived theoretically. Other architectures, such as Mamba/Mamba2 and LLaDA, are also discussed in our framework. Consequently, this paper provides a theoretical framework for understanding LLMs from the perspective of semantic information theory, which also offers the necessary theoretical tools for further in-depth research.

View Paper