Where does output diversity collapse in post-training?
Constantinos Karouzos, Xingwei Tan, Nikolaos Aletras
2026-04-20
Summary
This paper investigates why language models, after being further trained (post-trained) to be more helpful or follow instructions, often start giving very similar answers instead of a variety of creative responses.
What's the problem?
When you take a powerful language model and then train it further to be better at specific tasks, like answering questions in a detailed way or following instructions, it tends to lose its ability to generate diverse outputs. This is a problem because many applications, especially those needing creativity or dealing with subjective topics, rely on getting a range of different ideas. The researchers wanted to figure out *why* this happens – is it the way the models are trained, the data they’re trained on, or how we ask them to generate text?
What's the solution?
The researchers looked at three different ways of post-training a specific language model called Olmo 3. They compared how each method affected the variety of the model’s responses across 15 different tasks, using several ways to measure diversity. They found that the type of data used during post-training was a major factor. For example, one method lost diversity early in the training process, while another was more affected by a specific training technique. They also showed that the problem isn’t just about *how* you ask the model a question; the lack of diversity is actually built into the model itself during training. They even broke down diversity loss into removing incorrect answers versus narrowing down the range of correct answers, finding this differed depending on the task.
Why it matters?
This research is important because it shows that simply trying to tweak how we *use* a language model won’t fix the problem of limited diversity after post-training. The key is to focus on the training data itself and how it’s used. This means developers need to be more careful about the data they use to fine-tune models if they want them to remain creative and avoid producing overly similar outputs, especially for tasks where varied perspectives are important.
Abstract
Post-trained language models produce less varied outputs than their base counterparts. This output diversity collapse undermines inference-time scaling methods that rely on varied samples, and risks homogenizing model outputs on creative and value-laden tasks. Prior work attributes collapse to specific post-training methods, without separating the role of training data composition from the method, or the generation format from the model weights. We trace output diversity through three parallel post-training lineages of Olmo 3, Think (chain-of-thought distillation), Instruct (broad multi-source data), and RL-Zero, across 15 tasks and four text diversity metrics. We find that the location of collapse co-varies with data composition: the Think lineage loses most semantic diversity at supervised fine-tuning, and the effect of DPO is larger in Instruct than in Think. Suppressing chain-of-thought reasoning at inference in Think models drops accuracy on hard tasks, yet leaves answer-level diversity unchanged, showing that the collapse is embedded in the model weights by training data, not imposed by the generation format. Decomposing diversity loss on six verifiable tasks into a quality-control component (removal of incorrect outputs) and a residual component (genuine narrowing among correct outputs) reveals that the split is task-dependent, and Think models retain more correct-answer diversity than Instruct despite collapsing more in aggregate. Our results indicate that diversity collapse is determined during training by data composition and cannot be addressed at inference time alone.