Scaling Synthetic Data Creation with 1,000,000,000 Personas
Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, Dong Yu
2024-07-01

Summary
This paper talks about a new method called Persona Hub, which creates a huge collection of 1 billion diverse personas to generate synthetic data. These personas help in producing various types of data for different applications, making it easier to create realistic and useful content.
What's the problem?
Creating high-quality synthetic data can be challenging because it often requires a wide variety of perspectives and scenarios. Traditional methods may not capture the diversity needed for effective applications, leading to less accurate or relevant data. This limitation can hinder advancements in areas like AI training and user interaction.
What's the solution?
To solve this problem, the authors developed Persona Hub, a system that automatically gathers and organizes 1 billion unique personas from online sources. Each persona represents different characteristics, experiences, and knowledge. By using these personas, the system can generate diverse data sets for various tasks, such as creating math problems, logical reasoning challenges, user instructions, and even game characters. This approach allows for more flexibility and scalability in data creation.
Why it matters?
This research is important because it significantly enhances the ability to produce high-quality synthetic data that reflects a wide range of human experiences and perspectives. By tapping into such a large pool of personas, developers can create more effective AI systems that better understand and interact with users. This could lead to improvements in education, gaming, customer service, and many other fields where personalized and context-aware content is valuable.
Abstract
We propose a novel persona-driven data synthesis methodology that leverages various perspectives within a large language model (LLM) to create diverse synthetic data. To fully exploit this methodology at scale, we introduce Persona Hub -- a collection of 1 billion diverse personas automatically curated from web data. These 1 billion personas (~13% of the world's total population), acting as distributed carriers of world knowledge, can tap into almost every perspective encapsulated within the LLM, thereby facilitating the creation of diverse synthetic data at scale for various scenarios. By showcasing Persona Hub's use cases in synthesizing high-quality mathematical and logical reasoning problems, instructions (i.e., user prompts), knowledge-rich texts, game NPCs and tools (functions) at scale, we demonstrate persona-driven data synthesis is versatile, scalable, flexible, and easy to use, potentially driving a paradigm shift in synthetic data creation and applications in practice, which may have a profound impact on LLM research and development.