Audio Conditioning for Music Generation via Discrete Bottleneck Features

Simon Rouard, Yossi Adi, Jade Copet, Axel Roebel, Alexandre Défossez

2024-07-18

Audio Conditioning for Music Generation via Discrete Bottleneck Features

Summary

This paper discusses a new method for generating music using both audio input and text descriptions, allowing for more flexible and creative music creation.

What's the problem?

Most music generation systems rely on text-based instructions or specific parameters like tempo and genre to create music. However, these methods can be limiting because they don’t take into account actual audio inputs, which can provide richer context and inspiration for the music being generated. This means that the generated music might not fully capture the desired sound or style.

What's the solution?

The authors propose a method that combines audio conditioning with traditional text-based approaches. They use two main strategies: one involves mapping audio inputs to 'pseudowords' that represent them in a way the model can understand, while the other involves training a new music generation model from scratch that works with both text and audio features. This allows the system to mix and balance information from both sources effectively when creating music. Their method includes a novel technique called double classifier free guidance to enhance the quality of the generated music.

Why it matters?

This research is significant because it opens up new possibilities for music generation by allowing models to learn from actual audio, not just text. This could lead to more innovative and personalized music creations, making it useful for musicians, composers, and anyone interested in generating unique sounds. By improving how machines understand and create music, this work could enhance various applications in entertainment, gaming, and media production.

Abstract

While most music generation models use textual or parametric conditioning (e.g. tempo, harmony, musical genre), we propose to condition a language model based music generation system with audio input. Our exploration involves two distinct strategies. The first strategy, termed textual inversion, leverages a pre-trained text-to-music model to map audio input to corresponding "pseudowords" in the textual embedding space. For the second model we train a music language model from scratch jointly with a text conditioner and a quantized audio feature extractor. At inference time, we can mix textual and audio conditioning and balance them thanks to a novel double classifier free guidance method. We conduct automatic and human studies that validates our approach. We will release the code and we provide music samples on https://musicgenstyle.github.io in order to show the quality of our model.

View Paper