CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for Language Model Pre-training

Shizhe Diao, Yu Yang, Yonggan Fu, Xin Dong, Dan Su, Markus Kliegl, Zijia Chen, Peter Belcak, Yoshi Suhara, Hongxu Yin, Mostofa Patwary, Yingyan, Lin, Jan Kautz, Pavlo Molchanov

2025-04-18

CLIMB: CLustering-based Iterative Data Mixture Bootstrapping for
Language Model Pre-training

Summary

This paper talks about CLIMB, a new automated system for picking and mixing the best data for training language models, so they learn more effectively from huge amounts of text collected from the internet.

What's the problem?

The problem is that the giant text datasets used to train language models, like Common Crawl, don't have clear labels about what topics or domains the text comes from. Manually sorting and labeling all this data is extremely time-consuming, but just mixing everything together randomly isn't always the best way to help the model learn. Finding the right mix of data is really important for making language models smarter, but it's hard to do.

What's the solution?

The researchers created CLIMB, which uses machine learning to automatically group similar pieces of text together and then tests different combinations to find the best mix for training. This process is repeated and improved over time, using a smaller model to predict which data mixtures will work best before training the full model. They also built new datasets, ClimbLab and ClimbMix, to help researchers study and use these better mixtures.

Why it matters?

This matters because it helps language models learn more efficiently and perform better, especially on specific topics. By automating the process of finding the best training data, CLIMB saves a lot of time and resources, and it pushes the quality of language models even higher.

Abstract

Pre-training datasets are typically collected from web content and lack inherent domain divisions. For instance, widely used datasets like Common Crawl do not include explicit domain labels, while manually curating labeled datasets such as The Pile is labor-intensive. Consequently, identifying an optimal pre-training data mixture remains a challenging problem, despite its significant benefits for pre-training performance. To address these challenges, we propose CLustering-based Iterative Data Mixture Bootstrapping (CLIMB), an automated framework that discovers, evaluates, and refines data mixtures in a pre-training setting. Specifically, CLIMB embeds and clusters large-scale datasets in a semantic space and then iteratively searches for optimal mixtures using a smaller proxy model and a predictor. When continuously trained on 400B tokens with this mixture, our 1B model exceeds the state-of-the-art Llama-3.2-1B by 2.0%. Moreover, we observe that optimizing for a specific domain (e.g., Social Sciences) yields a 5% improvement over random sampling. Finally, we introduce ClimbLab, a filtered 1.2-trillion-token corpus with 20 clusters as a research playground, and ClimbMix, a compact yet powerful 400-billion-token dataset designed for efficient pre-training that delivers superior performance under an equal token budget. We analyze the final data mixture, elucidating the characteristics of an optimal data mixture. Our data is available at: https://research.nvidia.com/labs/lpr/climb/

View Paper