Stable Audio Open

Zach Evans, Julian D. Parker, CJ Carr, Zack Zukowski, Josiah Taylor, Jordi Pons

2024-07-22

Summary

This paper introduces Stable Audio Open, a new text-to-audio model that is publicly available for artists and researchers. It describes how the model was built and trained using Creative Commons data, and it shows that the model performs well compared to other advanced models.

What's the problem?

Many current text-to-audio models are private and not accessible to the public, which limits the ability of artists and researchers to use these tools for their projects. This lack of access makes it hard for people to build upon existing models or improve them, stifling innovation in the field of audio generation.

What's the solution?

The authors developed Stable Audio Open as an open-access text-to-audio model. They describe its architecture and training process in detail, highlighting how it was trained on Creative Commons data to ensure high-quality output. Their evaluations show that this model competes well with state-of-the-art models in generating realistic audio, particularly achieving high scores in realism tests for stereo sound at a quality of 44.1kHz.

Why it matters?

This research is important because it provides a valuable resource for the community by making a high-quality audio generation model available to everyone. By allowing artists and researchers to access and build on this model, it encourages creativity and innovation in audio production, potentially leading to new applications in music, sound design, and other creative fields.

Abstract

Open generative models are vitally important for the community, allowing for fine-tunes and serving as baselines when presenting new models. However, most current text-to-audio models are private and not accessible for artists and researchers to build upon. Here we describe the architecture and training process of a new open-weights text-to-audio model trained with Creative Commons data. Our evaluation shows that the model's performance is competitive with the state-of-the-art across various metrics. Notably, the reported FDopenl3 results (measuring the realism of the generations) showcase its potential for high-quality stereo sound synthesis at 44.1kHz.

View Paper