SampleMix: A Sample-wise Pre-training Data Mixing Strategey by Coordinating Data Quality and Diversity

Xiangyu Xi, Deyang Kong, Jian Yang, Jiawei Yang, Zhengyu Chen, Wei Wang, Jingang Wang, Xunliang Cai, Shikun Zhang, Wei Ye

2025-03-04

SampleMix: A Sample-wise Pre-training Data Mixing Strategey by
Coordinating Data Quality and Diversity

Summary

This paper talks about SampleMix, a new way to choose and mix training data for large language models (LLMs) that looks at each piece of data individually instead of grouping data by topics or domains.

What's the problem?

Current methods for mixing training data for LLMs group information by topics and then choose data from each topic equally. This approach misses connections between topics and doesn't consider how good or diverse each piece of information is. As a result, the training data might not be as good as it could be, which affects how well the AI learns.

What's the solution?

The researchers created SampleMix, which looks at each piece of data separately instead of grouping by topics. It checks how good and how different each piece of information is, and uses this to choose the best mix of data from all topics. This 'bottom-up' approach helps find connections between topics and ensures that the AI learns from the best and most diverse information available.

Why it matters?

This matters because SampleMix helps AI models learn better and faster. In tests, it performed better than other methods and needed only about half as much training time to reach the same level of performance. This could make it easier and cheaper to create powerful AI language models, potentially leading to better AI assistants, more accurate translation tools, and smarter chatbots that can understand and communicate more effectively.

Abstract

Existing pretraining data mixing methods for large language models (LLMs) typically follow a domain-wise methodology, a top-down process that first determines domain weights and then performs uniform data sampling across each domain. However, these approaches neglect significant inter-domain overlaps and commonalities, failing to control the global diversity of the constructed training dataset. Further, uniform sampling within domains ignores fine-grained sample-specific features, potentially leading to suboptimal data distribution. To address these shortcomings, we propose a novel sample-wise data mixture approach based on a bottom-up paradigm. This method performs global cross-domain sampling by systematically evaluating the quality and diversity of each sample, thereby dynamically determining the optimal domain distribution. Comprehensive experiments across multiple downstream tasks and perplexity assessments demonstrate that SampleMix surpasses existing domain-based methods. Meanwhile, SampleMix requires 1.4x to 2.1x training steps to achieves the baselines' performance, highlighting the substantial potential of SampleMix to optimize pre-training data.

View Paper