Step-Audio-EditX Technical Report

Chao Yan, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu, Yuxin Zhang, Xiangyu, Zhang, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu

2025-11-05

Summary

This paper introduces Step-Audio-EditX, a new computer program that can edit audio in really detailed ways, like changing the emotion in someone's voice or how they speak, and it can also create speech from text.

What's the problem?

Existing audio editing programs often struggle to make subtle changes to audio, especially when it comes to things like emotion or speaking style. They also usually need a lot of real-world audio examples to learn from, which can be hard to get and might not cover all the voices and styles you want. The goal was to create a program that could edit audio expressively and easily, without needing tons of pre-recorded examples.

What's the solution?

The researchers created Step-Audio-EditX, which uses a type of artificial intelligence called a large language model, similar to what powers chatbots. The key is that they trained it using *artificial* audio data they created themselves, rather than relying on recordings of real people. This artificial data was designed to be very clear and distinct, helping the program learn to control the audio effectively. This approach allows for precise control over the audio and lets it work with many different voices.

Why it matters?

This is important because it opens up possibilities for more accessible and flexible audio editing. Imagine easily changing the tone of a voice in a video game, creating personalized audiobooks with different emotional deliveries, or even helping people with speech impairments. Because the program is 'open-source,' meaning the code is freely available, other researchers and developers can build upon this work and create even more advanced audio tools.

Abstract

We present Step-Audio-EditX, the first open-source LLM-based audio model excelling at expressive and iterative audio editing encompassing emotion, speaking style, and paralinguistics alongside robust zero-shot text-to-speech (TTS) capabilities.Our core innovation lies in leveraging only large-margin synthetic data, which circumvents the need for embedding-based priors or auxiliary modules. This large-margin learning approach enables both iterative control and high expressivity across voices, and represents a fundamental pivot from the conventional focus on representation-level disentanglement. Evaluation results demonstrate that Step-Audio-EditX surpasses both MiniMax-2.6-hd and Doubao-Seed-TTS-2.0 in emotion editing and other fine-grained control tasks.

View Paper