Step-Audio-R1 Technical Report

Fei Tian, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Yuxin Li, Daijiao Liu, Yayue Deng, Donghang Wu, Jun Chen, Liang Zhao, Chengyuan Yao, Hexin Liu, Eng Siong Chng, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu

2025-11-21

Summary

This paper introduces Step-Audio-R1, a new model that successfully adds reasoning abilities to audio-based artificial intelligence, something that has been surprisingly difficult to achieve until now.

What's the problem?

Existing reasoning models work really well with text and images, using a process of thinking through problems step-by-step. However, when applied to audio, these models actually perform *worse* when asked to reason. This suggests that audio AI might not benefit from the same kind of deliberate thinking that helps other types of AI, and researchers weren't sure why or how to fix it.

What's the solution?

The researchers developed Step-Audio-R1 and a technique called Modality-Grounded Reasoning Distillation (MGRD). Essentially, MGRD teaches the model to create reasoning steps that are directly connected to the actual sounds it's analyzing, instead of just making up unrelated thoughts. This grounding in the audio itself is key to making reasoning work for audio AI. The model was tested on speech, environmental sounds, and music.

Why it matters?

This work shows that reasoning isn't limited to just text and images; it can be applied to audio too, as long as the reasoning is firmly based on the audio's characteristics. Step-Audio-R1 performs as well as, or even better than, some of the most advanced AI models like Gemini 3 Pro, and it paves the way for creating AI systems that can truly understand and reason about the world through all of our senses, not just sight and text.

Abstract

Recent advances in reasoning models have demonstrated remarkable success in text and vision domains through extended chain-of-thought deliberation. However, a perplexing phenomenon persists in audio language models: they consistently perform better with minimal or no reasoning, raising a fundamental question - can audio intelligence truly benefit from deliberate thinking? We introduce Step-Audio-R1, the first audio reasoning model that successfully unlocks reasoning capabilities in the audio domain. Through our proposed Modality-Grounded Reasoning Distillation (MGRD) framework, Step-Audio-R1 learns to generate audio-relevant reasoning chains that genuinely ground themselves in acoustic features rather than hallucinating disconnected deliberations. Our model exhibits strong audio reasoning capabilities, surpassing Gemini 2.5 Pro and achieving performance comparable to the state-of-the-art Gemini 3 Pro across comprehensive audio understanding and reasoning benchmarks spanning speech, environmental sounds, and music. These results demonstrate that reasoning is a transferable capability across modalities when appropriately anchored, transforming extended deliberation from a liability into a powerful asset for audio intelligence. By establishing the first successful audio reasoning model, Step-Audio-R1 opens new pathways toward building truly multimodal reasoning systems that think deeply across all sensory modalities.

View Paper