Physics in Next-token Prediction
Hongjun An, Yiliang Song, Xuelong Li
2024-11-04

Summary
This paper explores the physics behind Next-token Prediction (NTP), a method used in language models to predict the next word in a sequence. It introduces new principles that explain how information is conserved and how energy is related to training these models.
What's the problem?
Understanding how language models work and how they generate intelligent responses has been challenging. There hasn't been a clear explanation of the fundamental processes involved in Next-token Prediction, particularly regarding how information is transferred and what energy costs are associated with training these models.
What's the solution?
The authors discovered that NTP follows a principle similar to the conservation of energy, which they call the First Law of Information Capacity (IC-1). They also connected this idea to Landauer's Principle, which relates information processing to energy consumption, leading to the Second Law of Information Capacity (IC-2). These laws help explain how information is handled in language models and highlight the importance of efficient training methods. They also provided practical insights for improving production practices in model training.
Why it matters?
This research is important because it offers a deeper understanding of how language models function, particularly in predicting text. By linking information transfer to physical principles, it provides a framework that can help researchers and developers optimize model training and reduce energy consumption, making AI systems more efficient and sustainable.
Abstract
We discovered the underlying physics in Next-token Prediction (NTP). We identified the law of information conservation within NTP and proposed the First Law of Information Capacity (IC-1), demonstrating that the essence of intelligence emergence in auto-regressive models is fundamentally a process of information transfer. We also introduced Landauer's Principle into NTP, formulating the Second Law of Information Capacity (IC-2), which establishes the relationship between auto-regressive model training and energy consumption. Additionally, we presented several corollaries, which hold practical significance for production practices. Finally, we validated the compatibility and complementarity of our findings with existing theories.