Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

Sreyan Ghosh, Sonal Kumar, Zhifeng Kong, Rafael Valle, Bryan Catanzaro, Dinesh Manocha

2024-10-04

Synthio: Augmenting Small-Scale Audio Classification Datasets with Synthetic Data

Summary

This paper presents Synthio, a new method for improving small audio classification datasets by adding synthetic data to enhance the accuracy of audio classification models.

What's the problem?

Many audio classification tasks suffer from having too little labeled data, which makes it hard for models to learn effectively. Traditional methods for improving datasets, like adding noise or changing the audio slightly, often fail to create realistic and diverse examples that truly represent the variety of sounds found in real life. This limits the performance of models when they encounter new audio data.

What's the solution?

To solve this problem, the authors developed Synthio, which uses synthetic audio generated from text-to-audio (T2A) diffusion models to augment small datasets. They ensure that the synthetic audio matches the characteristics of the original dataset while also being diverse enough to help the model learn better. This involves aligning the generated audio with real audio using a technique called preference optimization and creating meaningful captions for the audio that guide the generation process. The authors tested Synthio on ten different datasets and found that it consistently improved classification accuracy compared to other methods.

Why it matters?

This research is important because it shows how synthetic data can be effectively used to enhance small-scale audio classification tasks. By improving the quality and diversity of training data, Synthio can help create more accurate and reliable audio classification models, which can be applied in various fields such as music recognition, speech analysis, and environmental sound classification.

Abstract

We present Synthio, a novel approach for augmenting small-scale audio classification datasets with synthetic data. Our goal is to improve audio classification accuracy with limited labeled data. Traditional data augmentation techniques, which apply artificial transformations (e.g., adding random noise or masking segments), struggle to create data that captures the true diversity present in real-world audios. To address this shortcoming, we propose to augment the dataset with synthetic audio generated from text-to-audio (T2A) diffusion models. However, synthesizing effective augmentations is challenging because not only should the generated data be acoustically consistent with the underlying small-scale dataset, but they should also have sufficient compositional diversity. To overcome the first challenge, we align the generations of the T2A model with the small-scale dataset using preference optimization. This ensures that the acoustic characteristics of the generated data remain consistent with the small-scale dataset. To address the second challenge, we propose a novel caption generation technique that leverages the reasoning capabilities of Large Language Models to (1) generate diverse and meaningful audio captions and (2) iteratively refine their quality. The generated captions are then used to prompt the aligned T2A model. We extensively evaluate Synthio on ten datasets and four simulated limited-data settings. Results indicate our method consistently outperforms all baselines by 0.1%-39% using a T2A model trained only on weakly-captioned AudioSet.

View Paper