Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture without Training

Mozhi Zhang, Howe Tissue, Lu Wang, Xipeng Qiu

2025-06-15

Domain2Vec: Vectorizing Datasets to Find the Optimal Data Mixture
without Training

Summary

This paper talks about Domain2Vec, a new method that turns whole datasets into mathematical vectors by breaking them down into smaller sets called meta-domains. These meta-domains capture important features of the data, helping computers to better understand and work with large collections of text data without needing to train on everything from scratch.

What's the problem?

The problem is that training language models with lots of data can take a lot of time and computer power because it's hard to know the best mix of data to use. Existing methods often require expensive training to find this mix, which slows down progress.

What's the solution?

The solution was to create Domain2Vec, which uses a classifier to represent any dataset as a combination of meta-domains without needing to train the language model first. This allows researchers to quickly find the best mixture of data for training by comparing these dataset vectors, saving a lot of time and computer resources.

Why it matters?

This matters because it helps make the training of language models much faster and more efficient, allowing better performance on tasks like text generation or understanding without requiring huge amounts of computing power. It makes AI development more accessible and sustainable.

Abstract

Domain2Vec decomposes datasets into meta-domains to optimize language model pretraining and downstream performance with reduced computational cost.

View Paper