Predicting the Order of Upcoming Tokens Improves Language Modeling

Zayd M. K. Zuhri, Erland Hilman Fuadi, Alham Fikri Aji

2025-08-28

Predicting the Order of Upcoming Tokens Improves Language Modeling

Summary

This paper investigates ways to improve how language models learn to predict text, focusing on a technique beyond simply guessing the next word.

What's the problem?

Language models are usually trained to predict the next word in a sentence. A previous attempt to help them learn, called Multi-Token Prediction, tried to get the model to predict several words ahead, but it didn't consistently make things better and often performed worse on standard tests. The researchers believe predicting the *exact* future words is too hard for the model as an extra learning task.

What's the solution?

The researchers propose a new method called Token Order Prediction. Instead of trying to guess the exact words coming up, it asks the model to simply rank those words by how soon they appear. Essentially, it learns which of the next few words is most likely to come first, second, and so on. This method is also more efficient, requiring less extra computational power to implement than the previous approach.

Why it matters?

This research shows that helping language models understand the *order* of upcoming words is a valuable way to improve their overall performance. The new Token Order Prediction method consistently outperformed both the standard next-word prediction and the previous Multi-Token Prediction, even when using very large models, suggesting it's a useful technique for building better language AI.

Abstract

Multi-Token Prediction (MTP) has been proposed as an auxiliary objective to improve next-token prediction (NTP) in language model training but shows inconsistent improvements, underperforming in standard NLP benchmarks. We argue that MTP's exact future token prediction is too difficult as an auxiliary loss. Instead, we propose Token Order Prediction (TOP), which trains models to order upcoming tokens by their proximity using a learning-to-rank loss. TOP requires only a single additional unembedding layer compared to MTP's multiple transformer layers. We pretrain models of 340M, 1.8B, and 7B parameters using NTP, MTP, and TOP objectives. Results on eight standard NLP benchmarks show that TOP overall outperforms both NTP and MTP even at scale. Our code is available at https://github.com/zaydzuhri/token-order-prediction

View Paper