Evaluating Sample Utility for Data Selection by Mimicking Model Weights

Tzu-Heng Huang, Manjot Bilkhu, Frederic Sala, Javier Movellan

2025-01-14

Evaluating Sample Utility for Data Selection by Mimicking Model Weights

Summary

This paper talks about a new way to pick out the best data for training AI models. The researchers created a tool called Mimic Score that helps figure out which pieces of information are most useful for teaching AI, kind of like choosing the best ingredients for a recipe.

What's the problem?

When training big AI models, people often use huge amounts of information from the internet. But this information can include a lot of stuff that's not helpful or even harmful, like incorrect data or biased information. It's hard to figure out which parts of all this data are actually good for teaching the AI.

What's the solution?

The researchers came up with Mimic Score, which is like a smart assistant for picking out good data. It works by looking at a well-trained AI model and using it as a guide to figure out which new information will be most helpful for training a new AI. They also created a system called Grad-Mimic that uses Mimic Score to automatically choose the best data for training. They tested their method on different types of image data and found that it helped AI models learn better and faster.

Why it matters?

This matters because it could make AI training more efficient and effective. By using only the best data, we can create smarter AI models that work better and are less likely to have problems like biases or mistakes. This could lead to better AI in all sorts of applications, from image recognition to language understanding, while using less computer power and time. It's like finding a way to teach AI more effectively, which could speed up progress in artificial intelligence research and applications.

Abstract

Foundation models rely on large-scale web-crawled datasets, which frequently contain noisy data, biases, and irrelevant content. Existing data selection techniques typically use human heuristics, downstream evaluation datasets, or specialized scoring models, and can overlook samples' utility in the training process. Instead, we propose a new approach, Mimic Score, a data quality metric that uses a pretrained reference model as a guide to assess the usefulness of data samples for training a new model. It relies on the alignment between the gradient of the new model parameters and the vector pointing toward the reference model in weight space. Samples that misalign with this direction are considered low-value and can be filtered out. Motivated by the Mimic score, we develop Grad-Mimic, a data selection framework that identifies and prioritizes useful samples, automating the selection process to create effective filters. Empirically, using Mimic scores to guide model training results in consistent performance gains across six image datasets and enhances the performance of CLIP models. Moreover, Mimic scores and their associated filters improve upon existing filtering methods and offer accurate estimation of dataset quality.

View Paper