MIG: Automatic Data Selection for Instruction Tuning by Maximizing Information Gain in Semantic Space

Yicheng Chen, Yining Li, Kai Hu, Zerun Ma, Haochen Ye, Kai Chen

2025-04-21

MIG: Automatic Data Selection for Instruction Tuning by Maximizing
Information Gain in Semantic Space

Summary

This paper talks about MIG, a new way to automatically pick the best and most useful data for training AI models to follow instructions, making the training process smarter and more effective.

What's the problem?

The problem is that when AI models are trained to follow instructions, the quality and variety of the training data are really important. If the data is too repetitive or doesn't cover enough different situations, the AI won't learn as well and might not handle new or tricky instructions.

What's the solution?

The researchers designed a method that measures how much useful information each piece of data adds by using something called a label graph and a sampling technique called MIG. This helps them select a mix of training examples that are both diverse and high-quality, so the AI can learn from a wider range of situations and instructions.

Why it matters?

This matters because it helps AI models get better at understanding and following all kinds of instructions, which is important for making AI more helpful, reliable, and able to handle real-world tasks.

Abstract

A unified method quantifies dataset information content using a label graph and an efficient MIG sampling technique to enhance diversity and quality in instruction-tuning datasets, outperforming existing methods.

View Paper