SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

Helin Wang, Jiarui Hai, Yen-Ju Lu, Karan Thakkar, Mounya Elhilali, Najim Dehak

2024-09-19

SoloAudio: Target Sound Extraction with Language-oriented Audio Diffusion Transformer

Summary

This paper introduces SoloAudio, a new model designed to extract specific sounds from audio mixtures using advanced techniques that combine language understanding and sound processing.

What's the problem?

Extracting a particular sound from a mixture of different sounds can be very challenging, especially when the target sound is buried among background noise. Traditional methods often struggle to isolate these sounds effectively, making it hard to use them in applications like music editing or sound analysis.

What's the solution?

The researchers developed SoloAudio, which uses a special type of model called a diffusion-based generative model. This model can understand both audio and textual descriptions of sounds. It employs a Transformer architecture that processes audio features in a way that allows it to accurately extract the desired sound. Additionally, SoloAudio can work in 'zero-shot' or 'few-shot' modes, meaning it can perform well even without extensive training on specific sounds. The model was trained using synthetic audio created by other advanced models, which helped it generalize well to new and unseen sounds.

Why it matters?

This research is significant because it improves the ability to isolate specific sounds in complex audio environments. This capability is useful for various applications, such as creating audiobooks, enhancing music production, and analyzing soundscapes. By making it easier to extract and manipulate sounds, SoloAudio opens up new possibilities for audio technology and creativity.

Abstract

In this paper, we introduce SoloAudio, a novel diffusion-based generative model for target sound extraction (TSE). Our approach trains latent diffusion models on audio, replacing the previous U-Net backbone with a skip-connected Transformer that operates on latent features. SoloAudio supports both audio-oriented and language-oriented TSE by utilizing a CLAP model as the feature extractor for target sounds. Furthermore, SoloAudio leverages synthetic audio generated by state-of-the-art text-to-audio models for training, demonstrating strong generalization to out-of-domain data and unseen sound events. We evaluate this approach on the FSD Kaggle 2018 mixture dataset and real data from AudioSet, where SoloAudio achieves the state-of-the-art results on both in-domain and out-of-domain data, and exhibits impressive zero-shot and few-shot capabilities. Source code and demos are released.

View Paper