Heavy Labels Out! Dataset Distillation with Label Space Lightening
Ruonan Yu, Songhua Liu, Zigeng Chen, Jingwen Ye, Xinchao Wang
2024-08-16

Summary
This paper discusses a new method called HeLlO for dataset distillation, which reduces the size of training datasets while maintaining performance without relying on heavy labels.
What's the problem?
Current methods for condensing large datasets into smaller ones often require storing large amounts of 'soft labels' (which are detailed information about the data) to achieve good performance. This can take up as much space as the original dataset, making it inefficient and impractical.
What's the solution?
The authors propose a framework called HeLlO that generates synthetic labels directly from images instead of storing heavy labels. They use existing knowledge from open-source models to create efficient projectors that can quickly generate these labels. This allows them to reduce storage needs to only about 0.003% of what was previously required while still achieving similar performance to existing methods.
Why it matters?
This research is important because it makes it easier and cheaper to train machine learning models by reducing the amount of data storage needed. By improving how datasets are condensed, it can help researchers and developers work more efficiently, especially when dealing with large-scale data.
Abstract
Dataset distillation or condensation aims to condense a large-scale training dataset into a much smaller synthetic one such that the training performance of distilled and original sets on neural networks are similar. Although the number of training samples can be reduced substantially, current state-of-the-art methods heavily rely on enormous soft labels to achieve satisfactory performance. As a result, the required storage can be comparable even to original datasets, especially for large-scale ones. To solve this problem, instead of storing these heavy labels, we propose a novel label-lightening framework termed HeLlO aiming at effective image-to-label projectors, with which synthetic labels can be directly generated online from synthetic images. Specifically, to construct such projectors, we leverage prior knowledge in open-source foundation models, e.g., CLIP, and introduce a LoRA-like fine-tuning strategy to mitigate the gap between pre-trained and target distributions, so that original models for soft-label generation can be distilled into a group of low-rank matrices. Moreover, an effective image optimization method is proposed to further mitigate the potential error between the original and distilled label generators. Extensive experiments demonstrate that with only about 0.003% of the original storage required for a complete set of soft labels, we achieve comparable performance to current state-of-the-art dataset distillation methods on large-scale datasets. Our code will be available.