Data Valuation using Neural Networks for Efficient Instruction Fine-Tuning
Ishika Agarwal, Dilek Hakkani-Tür
2025-02-18
Summary
This paper talks about a new method called NN-CIFT that uses small neural networks to estimate the importance of data in training large language models. It's like creating a tiny helper that can quickly figure out which information is most valuable for teaching a big AI system.
What's the problem?
Current methods for figuring out how important different pieces of data are for training large language models are very slow and use a lot of computer power. This makes it hard to use these methods with the biggest and most advanced AI models, especially when working with large datasets.
What's the solution?
The researchers created NN-CIFT, which uses small neural networks they call InfluenceNetworks to estimate how important each piece of data is. These small networks can do the job much faster and with much less computer power than previous methods. They tested NN-CIFT and found it could reduce the cost of this process by up to 99% while still being just as accurate as the slower, more resource-intensive methods.
Why it matters?
This matters because it could make it much easier and cheaper to train large AI language models. By quickly identifying which data is most important, researchers can focus on the most valuable information, potentially leading to better AI systems that can be developed more efficiently. This could speed up advancements in AI technology and make it more accessible to researchers with limited resources.
Abstract
Influence functions provide crucial insights into model training, but existing methods suffer from large computational costs and limited generalization. Particularly, recent works have proposed various metrics and algorithms to calculate the influence of data using language models, which do not scale well with large models and datasets. This is because of the expensive forward and backward passes required for computation, substantial memory requirements to store large models, and poor generalization of influence estimates to new data. In this paper, we explore the use of small neural networks -- which we refer to as the InfluenceNetwork -- to estimate influence values, achieving up to 99% cost reduction. Our evaluation demonstrates that influence values can be estimated with models just 0.0027% the size of full language models (we use 7B and 8B versions). We apply our algorithm of estimating influence values (called NN-CIFT: Neural Networks for effiCient Instruction Fine-Tuning) to the downstream task of subset selection for general instruction fine-tuning. In our study, we include four state-of-the-art influence functions and show no compromise in performance, despite large speedups, between NN-CIFT and the original influence functions. We provide an in-depth hyperparameter analyses of NN-CIFT. The code for our method can be found here: https://github.com/agarwalishika/NN-CIFT.