Step-Audio 2 Technical Report

Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, Mingrui Chen, Peng Liu, Wang You, Xiangyu Tony Zhang, Xingyuan Li, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang, Zhao You, Brian Li

2025-07-23

Summary

This paper talks about Step-Audio 2, an advanced multi-modal large language model designed to understand and generate audio and speech by combining latent audio encoding with reinforcement learning.

What's the problem?

The problem is that previous audio language models struggled with capturing not just the words people say but also the extra details like emotions, speaking styles, and other voice characteristics, making conversations less natural and expressive.

What's the solution?

The authors built Step-Audio 2 to generate discrete audio tokens alongside text, enabling it to capture subtle voice features and emotions. It uses reinforcement learning to improve its reasoning abilities and integrates external tools like web and audio search to provide more reliable and expressive responses. The model was trained on millions of hours of speech and text data.

Why it matters?

This matters because it helps build voice assistants and AI conversational agents that sound more natural, understand emotions better, and can have more engaging and accurate speech interactions across many languages and scenarios.

Abstract

Step-Audio~2, an end-to-end multi-modal large language model, integrates latent audio encoding and reinforcement learning to achieve state-of-the-art performance in ASR, audio understanding, and speech conversation, incorporating discrete audio token generation and retrieval-augmented generation.

View Paper