Facing the Music: Tackling Singing Voice Separation in Cinematic Audio Source Separation
Karn N. Watcharasupat, Chih-Wei Wu, Iroro Orife
2024-08-08

Summary
This paper discusses a new method for separating singing voices from other audio elements in movies, focusing on improving how we handle complex audio scenes.
What's the problem?
In cinematic audio, sounds are often mixed together, making it difficult to separate different elements like dialogue, music, and sound effects. A common challenge is the singing voice, which can belong to either the dialogue or music categories depending on the context. This creates confusion and makes it hard to accurately isolate each sound for editing or analysis.
What's the solution?
The authors propose a new approach that expands the traditional three categories (dialogue, music, and effects) into four distinct categories by treating singing voices as a separate element. They enhance existing models by using a query-based system that better aligns features for separation. This allows for more precise extraction of the singing voice from the mix, improving overall audio quality.
Why it matters?
This research is important because it addresses a significant gap in audio processing for films and media. By improving the ability to separate singing voices from other sounds, this work can enhance post-production processes in filmmaking and improve audio quality in various applications, leading to better viewer experiences.
Abstract
Cinematic audio source separation (CASS) is a fairly new subtask of audio source separation. A typical setup of CASS is a three-stem problem, with the aim of separating the mixture into the dialogue stem (DX), music stem (MX), and effects stem (FX). In practice, however, several edge cases exist as some sound sources do not fit neatly in either of these three stems, necessitating the use of additional auxiliary stems in production. One very common edge case is the singing voice in film audio, which may belong in either the DX or MX, depending heavily on the cinematic context. In this work, we demonstrate a very straightforward extension of the dedicated-decoder Bandit and query-based single-decoder Banquet models to a four-stem problem, treating non-musical dialogue, instrumental music, singing voice, and effects as separate stems. Interestingly, the query-based Banquet model outperformed the dedicated-decoder Bandit model. We hypothesized that this is due to a better feature alignment at the bottleneck as enforced by the band-agnostic FiLM layer. Dataset and model implementation will be made available at https://github.com/kwatcharasupat/source-separation-landing.