VOYAGER: A Training Free Approach for Generating Diverse Datasets using LLMs
Avinash Amballa, Yashas Malur Saidutta, Chi-Heng Lin, Vivek Kulkarni, Srinivas Chappidi
2025-12-18
Summary
This paper introduces a new method called Voyager for creating datasets using artificial intelligence, specifically large language models. These datasets are used to test and improve other AI models, but existing methods often create datasets that aren't very varied.
What's the problem?
When AI models generate data to train or test other AI, the generated data tends to be repetitive and lacks a wide range of examples. This limited diversity can prevent the tested AI from learning effectively or accurately reflecting real-world scenarios. Essentially, if the practice data isn't diverse, the AI won't be either.
What's the solution?
Voyager tackles this problem by repeatedly generating data and specifically choosing examples that are different from each other. It uses a mathematical technique called determinantal point processes to actively maximize the diversity of the dataset. A key benefit is that it doesn't require any additional training of the language model itself and can even work with AI models where the inner workings aren't publicly available, and it can handle large-scale dataset creation.
Why it matters?
This research is important because it provides a way to create much better datasets for training and evaluating AI. By improving the diversity of these datasets, Voyager helps ensure that AI models are more robust, reliable, and capable of handling a wider variety of situations. The experiments show Voyager creates datasets 1.5 to 3 times more diverse than current methods, which is a significant improvement.
Abstract
Large language models (LLMs) are increasingly being used to generate synthetic datasets for the evaluation and training of downstream models. However, prior work has noted that such generated data lacks diversity. In this paper, we propose Voyager, a novel principled approach to generate diverse datasets. Our approach is iterative and directly optimizes a mathematical quantity that optimizes the diversity of the dataset using the machinery of determinantal point processes. Furthermore, our approach is training-free, applicable to closed-source models, and scalable. In addition to providing theoretical justification for the working of our method, we also demonstrate through comprehensive experiments that Voyager significantly outperforms popular baseline approaches by providing a 1.5-3x improvement in diversity.