WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training

Benjamin Feuer, Chinmay Hegde

2025-01-31

WILDCHAT-50M: A Deep Dive Into the Role of Synthetic Data in Post-Training

Summary

This paper talks about WILDCHAT-50M, a new and massive dataset created to help improve large language models (LLMs) after their initial training. It's like a huge collection of practice conversations for AI to learn from, created using many different AI models.

What's the problem?

The problem is that while we have ways to make AI language models better after they're first trained, we don't have enough good data to test and compare these methods. It's like trying to improve a student's skills but not having enough varied practice tests to see what works best. This makes it hard for researchers to figure out the best ways to teach AI new skills or behaviors.

What's the solution?

The researchers created WILDCHAT-50M, which is the biggest public dataset of AI conversations ever made. They took an existing dataset called WildChat and made it much bigger by adding responses from over 50 different AI models, ranging from small to very large ones. They then used this huge dataset to create a new training method called RE-WILD. When they tested RE-WILD, it worked better than a previous method (Tulu-3) while using less than half the amount of data.

Why it matters?

This matters because it gives researchers a powerful new tool to improve AI language models. With WILDCHAT-50M, they can now test different training methods more thoroughly and compare them fairly. This could lead to AI that's better at understanding and communicating with humans in a wider range of situations. By making their dataset and code public, the researchers are also helping the whole AI community work together to advance the field faster. In the long run, this could mean smarter, more helpful AI assistants for everyone.

Abstract

Language model (LLM) post-training, from DPO to distillation, can refine behaviors and unlock new skills, but the open science supporting these post-training techniques is still in its infancy. One limiting factor has been the difficulty of conducting large-scale comparative analyses of synthetic data generating models and LLM judges. To close this gap, we introduce WILDCHAT-50M, the largest public chat dataset to date. We extend the existing WildChat dataset to include responses not only from GPT, but from over 50 different open-weight models, ranging in size from 0.5B to 104B parameters. We conduct an extensive comparative analysis and demonstrate the potential of this dataset by creating RE-WILD, our own public SFT mix, which outperforms the recent Tulu-3 SFT mixture from Allen AI with only 40% as many samples. Our dataset, samples and code are available at https://github.com/penfever/wildchat-50m.

View Paper