Train a Unified Multimodal Data Quality Classifier with Synthetic Data
Weizhi Wang, Rongmei Lin, Shiyang Li, Colin Lockard, Ritesh Sarkhel, Sanket Lokegaonkar, Jingbo Shang, Xifeng Yan, Nasser Zalmout, Xian Li
2025-10-20
Summary
This paper focuses on improving how well AI models that understand both images and text, called Multimodal Large Language Models (MLLMs), learn by carefully selecting the data they are trained on.
What's the problem?
Currently, MLLMs are trained on huge amounts of image and text data, but not much attention has been paid to *which* data is actually high quality. A lot of the available data might be noisy or not very helpful, hindering the model's ability to learn effectively. Getting enough labeled data to determine quality is also difficult and expensive.
What's the solution?
The researchers created a system called UniFilter, which is essentially an AI that learns to identify high-quality image-text pairs and image-text documents. To train UniFilter, they cleverly generated their own training data, creating examples with varying levels of quality using existing images and automatically generated captions. They then used UniFilter to clean up two large datasets, DataComp and OBELICS, removing lower-quality examples.
Why it matters?
Training MLLMs on this filtered, higher-quality data resulted in significantly better performance. The models were better at reasoning about images and text, and also learned more effectively from examples given during use. The researchers are also sharing their tools and the cleaned-up dataset with the AI community, which will help others build even better multimodal AI systems.
Abstract
The Multimodal Large Language Models (MLLMs) are continually pre-trained on a mixture of image-text caption data and interleaved document data, while the high-quality data filtering towards image-text interleaved document data is under-explored. We propose to train an efficient MLLM as a Unified Mulitmodal Data Quality Classifier to Filter both high-quality image-text caption and interleaved data (UniFilter). To address the challenge of collecting diverse labeled multimodal data, we introduce a semi-synthetic approach that leverages readily available raw images and generates corresponding text across four quality levels. This method enables efficient creation of sample-score pairs for both caption and interleaved document data to train UniFilter. We apply UniFilter to curate high-quality caption data from DataComp caption dataset and interleaved data from the OBELICS image-text interleaved dataset. MLLMs pre-trained on the filtered data demonstrate significantly enhanced capabilities compared to those trained on baseline-filtered data, achieving stronger zero-shot reasoning and in-context learning capabilities. After visual supervised fine-tuning, these UniFilter-induced MLLMs achieve stronger performance on various benchmarks, highlighting the downstream benefits of high-quality multimodal pre-training. We release the synthetic training data used for training UniFilter, the UniFilter model checkpoints, and the high-quality interleaved document subset OBELICS-HQ, curated by UniFilter, to the community for reproduction and further development.