SAO-Instruct: Free-form Audio Editing using Natural Language Instructions

Michael Ungersböck, Florian Grötschla, Luca A. Lanzendörfer, June Young Yi, Changho Choi, Roger Wattenhofer

2025-10-29

SAO-Instruct: Free-form Audio Editing using Natural Language Instructions

Summary

This paper introduces a new model, SAO-Instruct, that can change existing audio recordings based on simple, everyday language instructions. It builds upon a powerful audio generation system called Stable Audio Open.

What's the problem?

Currently, it's really hard to edit audio using just natural language. Existing methods either need you to describe the *entire* audio clip after editing, or they only allow very specific, pre-programmed edits. This limits how easily and creatively you can modify sounds.

What's the solution?

The researchers created SAO-Instruct by training a model on a dataset of audio examples paired with instructions on how to change them. They used a combination of automatically generated data and manual editing to build this dataset. The model learns to take an audio clip and a text instruction, and then modify the audio accordingly. It's designed to work well even with real-world audio and instructions it hasn't seen before.

Why it matters?

This work is important because it moves us closer to being able to edit audio as easily as we edit text. Imagine being able to say 'make the birds sound louder' or 'remove the background noise' and having a computer do it automatically. This could have huge implications for music production, podcasting, and accessibility.

Abstract

Generative models have made significant progress in synthesizing high-fidelity audio from short textual descriptions. However, editing existing audio using natural language has remained largely underexplored. Current approaches either require the complete description of the edited audio or are constrained to predefined edit instructions that lack flexibility. In this work, we introduce SAO-Instruct, a model based on Stable Audio Open capable of editing audio clips using any free-form natural language instruction. To train our model, we create a dataset of audio editing triplets (input audio, edit instruction, output audio) using Prompt-to-Prompt, DDPM inversion, and a manual editing pipeline. Although partially trained on synthetic data, our model generalizes well to real in-the-wild audio clips and unseen edit instructions. We demonstrate that SAO-Instruct achieves competitive performance on objective metrics and outperforms other audio editing approaches in a subjective listening study. To encourage future research, we release our code and model weights.

View Paper