SAM Audio: Segment Anything in Audio

Bowen Shi, Andros Tjandra, John Hoffman, Helin Wang, Yi-Chiao Wu, Luya Gao, Julius Richter, Matt Le, Apoorv Vyas, Sanyuan Chen, Christoph Feichtenhofer, Piotr Dollár, Wei-Ning Hsu, Ann Lee

2025-12-24

Summary

This paper introduces SAM Audio, a new AI model designed to pull apart different sounds from a single recording, like separating a singer's voice from the instruments in a song or isolating a specific sound effect.

What's the problem?

Current audio separation models are often limited because they're built for specific types of sounds, like *only* music or *only* speech. Also, they usually only respond to one type of instruction – for example, you might be able to tell it to separate the vocals using text, but not by pointing to a section of the sound wave or showing a visual representation of what you want to isolate. This makes it hard to create a truly versatile system that can handle any sound and respond to different kinds of requests.

What's the solution?

The researchers created SAM Audio, which uses a powerful architecture called a diffusion transformer. They trained it on a huge amount of audio data, including speech, music, and everyday sounds. This allows SAM Audio to separate sounds based on instructions given as text, by visually highlighting the sound you want, or by specifying a time frame where the sound occurs. It essentially combines all these prompting methods into one system.

Why it matters?

SAM Audio is a big step forward because it's a general-purpose audio separation model that performs better than existing systems across many different types of audio. It’s more flexible and can understand a wider range of instructions, making it useful for building more advanced AI systems that can truly understand and interact with the world through sound. They also created a new way to test these kinds of models that better reflects how people actually want to use them.

Abstract

General audio source separation is a key capability for multimodal AI systems that can perceive and reason about sound. Despite substantial progress in recent years, existing separation models are either domain-specific, designed for fixed categories such as speech or music, or limited in controllability, supporting only a single prompting modality such as text. In this work, we present SAM Audio, a foundation model for general audio separation that unifies text, visual, and temporal span prompting within a single framework. Built on a diffusion transformer architecture, SAM Audio is trained with flow matching on large-scale audio data spanning speech, music, and general sounds, and can flexibly separate target sources described by language, visual masks, or temporal spans. The model achieves state-of-the-art performance across a diverse suite of benchmarks, including general sound, speech, music, and musical instrument separation in both in-the-wild and professionally produced audios, substantially outperforming prior general-purpose and specialized systems. Furthermore, we introduce a new real-world separation benchmark with human-labeled multimodal prompts and a reference-free evaluation model that correlates strongly with human judgment.

View Paper