Continuous Autoregressive Language Models
Chenze Shao, Darren Li, Fandong Meng, Jie Zhou
2025-11-03
Summary
This paper introduces a new way to build large language models (LLMs) that focuses on making each step of generating text more efficient, ultimately aiming for faster and less computationally expensive models.
What's the problem?
Current LLMs generate text one piece at a time, like predicting the next word in a sentence. This process is slow and requires a lot of computing power because each prediction is a separate step. The more text an LLM generates, the more steps it takes, and the more resources are used. Essentially, they're limited by how quickly they can process information sequentially.
What's the solution?
The researchers developed a system called CALM, which stands for Continuous Autoregressive Language Models. Instead of predicting the next single word (or 'token'), CALM predicts a continuous 'vector' that represents a whole chunk of words at once. Think of it like summarizing a paragraph into a single idea instead of writing it word by word. They use a special tool called an autoencoder to compress these chunks of text into vectors and then reconstruct them with very high accuracy. This dramatically reduces the number of steps needed to generate text, making the process much faster.
Why it matters?
This research is important because it shows a promising new direction for building LLMs. By focusing on increasing the amount of information processed in each step, CALM achieves similar performance to existing models but with significantly less computing power. This could lead to the development of ultra-efficient language models that are more accessible and environmentally friendly, paving the way for more powerful AI without requiring massive amounts of energy.
Abstract
The efficiency of large language models (LLMs) is fundamentally limited by their sequential, token-by-token generation process. We argue that overcoming this bottleneck requires a new design axis for LLM scaling: increasing the semantic bandwidth of each generative step. To this end, we introduce Continuous Autoregressive Language Models (CALM), a paradigm shift from discrete next-token prediction to continuous next-vector prediction. CALM uses a high-fidelity autoencoder to compress a chunk of K tokens into a single continuous vector, from which the original tokens can be reconstructed with over 99.9\% accuracy. This allows us to model language as a sequence of continuous vectors instead of discrete tokens, which reduces the number of generative steps by a factor of K. The paradigm shift necessitates a new modeling toolkit; therefore, we develop a comprehensive likelihood-free framework that enables robust training, evaluation, and controllable sampling in the continuous domain. Experiments show that CALM significantly improves the performance-compute trade-off, achieving the performance of strong discrete baselines at a significantly lower computational cost. More importantly, these findings establish next-vector prediction as a powerful and scalable pathway towards ultra-efficient language models. Code: https://github.com/shaochenze/calm. Project: https://shaochenze.github.io/blog/2025/CALM.