Diversity or Precision? A Deep Dive into Next Token Prediction
Haoyuan Wu, Hai Wang, Jiajia Wu, Jinxiang Ou, Keyao Wang, Weile Chen, Zihao Zheng, Bei Yu
2026-01-05
Summary
This paper investigates how the way large language models are initially trained impacts their ability to improve with reinforcement learning, specifically focusing on how the model chooses what words to generate next.
What's the problem?
Large language models are getting better at reasoning when trained with reinforcement learning, but this improvement heavily relies on how the model was trained *before* the reinforcement learning stage. The standard way of training these models, using something called cross-entropy loss, might not be setting them up for the best possible learning experience with reinforcement learning. It's unclear if a more diverse or a more focused initial training approach is better for later improvement.
What's the solution?
The researchers propose a new way to initially train the language model by looking at next-word prediction as a decision-making process. They developed a system that rewards the model not just for predicting the correct word, but also for considering a variety of possible words, but in a smart way. They give a bigger boost to getting the right answer and treat incorrect answers differently based on how likely they were to begin with. This reshapes the model’s initial tendencies, creating a better starting point for reinforcement learning.
Why it matters?
This work is important because it shows that how you initially train a language model is crucial for its ability to learn and improve with reinforcement learning. Surprisingly, the research suggests that focusing on accuracy during initial training, rather than maximizing diversity, actually creates a better environment for the model to learn complex reasoning skills later on. This could lead to more effective training methods for future language models.
Abstract
Recent advancements have shown that reinforcement learning (RL) can substantially improve the reasoning abilities of large language models (LLMs). The effectiveness of such RL training, however, depends critically on the exploration space defined by the pre-trained model's token-output distribution. In this paper, we revisit the standard cross-entropy loss, interpreting it as a specific instance of policy gradient optimization applied within a single-step episode. To systematically study how the pre-trained distribution shapes the exploration potential for subsequent RL, we propose a generalized pre-training objective that adapts on-policy RL principles to supervised learning. By framing next-token prediction as a stochastic decision process, we introduce a reward-shaping strategy that explicitly balances diversity and precision. Our method employs a positive reward scaling factor to control probability concentration on ground-truth tokens and a rank-aware mechanism that treats high-ranking and low-ranking negative tokens asymmetrically. This allows us to reshape the pre-trained token-output distribution and investigate how to provide a more favorable exploration space for RL, ultimately enhancing end-to-end reasoning performance. Contrary to the intuition that higher distribution entropy facilitates effective exploration, we find that imposing a precision-oriented prior yields a superior exploration space for RL.