Efficient Data Selection at Scale via Influence Distillation
Mahdi Nikdan, Vincent Cohen-Addad, Dan Alistarh, Vahab Mirrokni
2025-05-29
Summary
This paper talks about a new method called Influence Distillation that helps computers pick out the most useful pieces of data when training large language models, making the whole process faster and just as effective as before.
What's the problem?
The problem is that training big AI models usually involves huge amounts of data, which can take a lot of time and computer resources. Not all the data is equally helpful, so using everything can be wasteful and slow down progress.
What's the solution?
The researchers developed a way for the AI to figure out which data points are the most important for learning, using advanced math to measure their influence. By focusing only on the most helpful data, the model can learn faster and still perform really well on different tasks.
Why it matters?
This is important because it means AI can be trained more efficiently, saving time and energy while still achieving high-quality results. It makes powerful AI tools more accessible and practical for a wide range of uses.
Abstract
Influence Distillation, using second-order information, optimally selects training data for LLM fine-tuning with landmark-based approximation, achieving faster and competitive performance on various tasks.