Jointly Reinforcing Diversity and Quality in Language Model Generations

Tianjian Li, Yiming Zhang, Ping Yu, Swarnadeep Saha, Daniel Khashabi, Jason Weston, Jack Lanchantin, Tianlu Wang

2025-09-03

Jointly Reinforcing Diversity and Quality in Language Model Generations

Summary

This paper focuses on a problem with how we improve large language models (LLMs) – while making them more accurate and helpful, we often unintentionally make their responses less diverse and creative.

What's the problem?

When we 'train' LLMs after their initial development to be better at tasks, we usually focus on getting the *right* answer and making sure the responses are useful. However, this process tends to make the models stick to very predictable and similar responses. This is a problem because if a model always gives the same type of answer, it's not very good for tasks that require brainstorming, writing stories, or finding multiple solutions to a problem; it limits their ability to explore different ideas.

What's the solution?

The researchers developed a new method called Diversity-Aware Reinforcement Learning, or DARLING. DARLING doesn't just reward the model for being correct, it *also* rewards it for being different and generating unique responses. It does this by figuring out a way to measure how different responses are in terms of their meaning, not just the words used. This 'diversity signal' is then used alongside the usual 'quality' signal during the training process, pushing the model to be both good *and* original.

Why it matters?

This work is important because it shows that we can improve LLMs to be both high-quality *and* diverse. The experiments showed that DARLING works well on different kinds of tasks, including creative writing and even solving math problems. Surprisingly, encouraging diversity actually helped the model learn to give better answers overall, because it forced it to explore a wider range of possibilities during training.

Abstract

Post-training of Large Language Models (LMs) often prioritizes accuracy and helpfulness at the expense of diversity. This creates a tension: while post-training improves response quality, it also sharpens output distributions and reduces the range of ideas, limiting the usefulness of LMs in creative and exploratory tasks such as brainstorming, storytelling, or problem solving. We address this challenge with Diversity-Aware Reinforcement Learning (DARLING), a framework that jointly optimizes for response quality and semantic diversity. At its core, DARLING introduces a learned partition function to measure diversity beyond surface-level lexical variations. This diversity signal is then combined with a quality reward during online reinforcement learning, encouraging models to generate outputs that are both high-quality and distinct. Experiments across multiple model families and sizes show that DARLING generalizes to two regimes: non-verifiable tasks (instruction following and creative writing) and verifiable tasks (competition math). On five benchmarks in the first setting, DARLING consistently outperforms quality-only RL baselines, producing outputs that are simultaneously of higher quality and novelty. In the second setting, DARLING achieves higher pass@1 (solution quality) and pass@k (solution variety). Most strikingly, explicitly optimizing for diversity catalyzes exploration in online RL, which manifests itself as higher-quality responses.

View Paper