SimpleStrat: Diversifying Language Model Generation with Stratification
Justin Wong, Yury Orlovskiy, Michael Luo, Sanjit A. Seshia, Joseph E. Gonzalez
2024-10-14

Summary
This paper presents SimpleStrat, a new method that helps large language models (LLMs) generate more diverse and interesting responses by organizing their output into different categories, or 'strata.'
What's the problem?
Generating varied responses from LLMs is important for applications like creative writing and data generation. However, traditional methods to increase diversity, such as raising the temperature during generation, can lower the quality of the responses. This leads to outputs that are often too similar and not as engaging.
What's the solution?
SimpleStrat addresses this issue by using the LLM itself to create different strata based on characteristics like topic or sentiment. When generating responses, the model randomly selects one of these strata and produces a response from that category. This stratified approach allows for a wider range of outputs while maintaining quality. The researchers also introduced a new dataset called CoverageQA to measure how diverse the generated responses are.
Why it matters?
This research is significant because it enhances the ability of language models to produce varied and high-quality text. By improving diversity in generated outputs, SimpleStrat can make applications like chatbots, storytelling, and content creation more engaging and reflective of real-world complexity.
Abstract
Generating diverse responses from large language models (LLMs) is crucial for applications such as planning/search and synthetic data generation, where diversity provides distinct answers across generations. Prior approaches rely on increasing temperature to increase diversity. However, contrary to popular belief, we show not only does this approach produce lower quality individual generations as temperature increases, but it depends on model's next-token probabilities being similar to the true distribution of answers. We propose , an alternative approach that uses the language model itself to partition the space into strata. At inference, a random stratum is selected and a sample drawn from within the strata. To measure diversity, we introduce CoverageQA, a dataset of underspecified questions with multiple equally plausible answers, and assess diversity by measuring KL Divergence between the output distribution and uniform distribution over valid ground truth answers. As computing probability per response/solution for proprietary models is infeasible, we measure recall on ground truth solutions. Our evaluation show using SimpleStrat achieves higher recall by 0.05 compared to GPT-4o and 0.36 average reduction in KL Divergence compared to Llama 3.