Concept-Aware Batch Sampling Improves Language-Image Pretraining
Adhiraj Ghosh, Vishaal Udandarao, Thao Nguyen, Matteo Farina, Mehdi Cherti, Jenia Jitsev, Sewoong Oh, Elisa Ricci, Ludwig Schmidt, Matthias Bethge
2025-11-26
Summary
This paper investigates how to best choose the data used to train vision-language models, which are AI systems that can understand both images and text. It argues that current methods for selecting training data aren't ideal because they're often fixed and don't consider the specific concepts within the images and text.
What's the problem?
Existing methods for choosing training data for vision-language models are usually done *after* the data is collected, meaning they can't adapt as the model learns. They also tend to rely on the model itself to filter data, which can introduce biases and limit the model's understanding to what it already 'knows'. Essentially, these methods aren't flexible enough to target specific learning goals or cover a wide range of concepts.
What's the solution?
The researchers created a large dataset called DataConcept containing 128 million image-text pairs with detailed information about the concepts they represent. They then developed a technique called Concept-Aware Batch Sampling (CABS) which dynamically selects batches of data *during* training. CABS has two versions: one that prioritizes showing the model a diverse set of concepts (Diversity Maximization) and another that focuses on concepts that appear frequently (Frequency Maximization). This allows the model to learn in a more targeted and efficient way.
Why it matters?
This work provides a powerful, publicly available alternative to the closed-source methods companies use to curate training data for their AI models. By allowing researchers and developers to customize the concepts the model learns about, CABS can lead to better performance on specific tasks and a more robust understanding of images and text. It's a step towards building more adaptable and less biased vision-language models.
Abstract
What data should a vision-language model be trained on? To answer this question, many data curation efforts center on the quality of a dataset. However, most of these existing methods are (i) offline, i.e. they produce a static dataset from a set of predetermined filtering criteria, and (ii) concept-agnostic, i.e. they use model-based filters which induce additional data biases. In this work, we go beyond such offline, concept-agnostic methods and advocate for more flexible, task-adaptive online concept-based curation. Our first contribution is DataConcept, a collection of 128M web-crawled image-text pairs annotated with fine-grained details about their concept composition. Building on DataConcept, we introduce Concept-Aware Batch Sampling (CABS), a simple yet effective batch sampling framework that flexibly constructs batches on-the-fly based on specific target distributions. We propose two variants: (i) Diversity Maximization (CABS-DM) to curate batches with a broad coverage of available concepts, and (ii) Frequency Maximization (CABS-FM) to curate batches with high object multiplicity. Through extensive evaluations across 28 benchmarks, we demonstrate that our CABS method significantly benefits CLIP/SigLIP model classes and yields highly performant models. Overall, CABS represents a strong open-source alternative to proprietary online data curation algorithms, enabling practitioners to define custom concept distributions that optimize for specific downstream tasks.