ZeroSep: Separate Anything in Audio with Zero Training

Chao Huang, Yuesheng Ma, Junxuan Huang, Susan Liang, Yunlong Tang, Jing Bi, Wenqiang Liu, Nima Mesgarani, Chenliang Xu

2025-05-30

ZeroSep: Separate Anything in Audio with Zero Training

Summary

This paper talks about ZeroSep, a new AI tool that can pull apart different sounds from a single audio recording—like separating vocals from music—just by using a text description, and it can do this without needing to be trained on examples for each specific task.

What's the problem?

The problem is that most systems for separating sounds in audio need a lot of training with labeled examples for each type of sound you want to split up, which is time-consuming and limits what the AI can do if it hasn't seen that kind of sound before.

What's the solution?

The researchers built ZeroSep using a special type of AI model called a text-guided audio diffusion model. This model is already trained on lots of audio and can understand instructions written in text, so it can separate out any sound you describe, even if it hasn't seen that exact situation before.

Why it matters?

This is important because it makes audio editing and analysis much easier and more flexible, helping musicians, podcasters, and anyone working with sound to isolate and work with specific audio parts without needing tons of training data or technical skills.

Abstract

ZeroSep, a text-guided audio diffusion model, achieves zero-shot source separation through pre-trained models and text conditioning, outperforming supervised methods on various benchmarks.

View Paper