Step-Audio-R1.5 Technical Report

Yuxin Zhang, Xiangyu Tony Zhang, Daijiao Liu, Fei Tian, Yayue Deng, Jun Chen, Qingjian Lin, Haoyang Zhang, Yuxin Li, Jinglan Gong, Yechang Huang, Liang Zhao, Chengyuan Yao, Hexin Liu, Eng Siong Chng, Xuerui Yang, Gang Yu, Xiangyu Zhang, Daxin Jiang

2026-04-29

Summary

This paper explores how we train AI models to 'think' with audio, specifically focusing on making them better at understanding and responding in conversations. It argues that current training methods, while making models accurate, often result in robotic and unnatural interactions.

What's the problem?

Currently, AI audio models are trained using a method that rewards them for getting the 'right' answer based on specific, verifiable labels. This is similar to giving a student points only for correct answers on a multiple-choice test. While this improves accuracy on tests, it causes the models to focus on correctness *over* sounding natural and engaging in a realistic conversation. The models become good at answering questions, but bad at having a flowing, emotionally intelligent dialogue – they fall into what the authors call the 'verifiable reward trap'.

What's the solution?

The researchers introduce a new training method called Step-Audio-R1.5. Instead of just rewarding correct answers, this method uses feedback from real people to teach the AI what makes a conversation feel natural and immersive. This is like having a teacher give feedback on *how* a student explains an answer, not just if the answer is right or wrong. This approach shifts the focus from simply being correct to being genuinely engaging and empathetic.

Why it matters?

This work is important because it highlights the limitations of focusing solely on accuracy when building AI that interacts with humans. It demonstrates that truly intelligent audio AI needs to go beyond just understanding *what* is said and also understand *how* it's said, and respond in a way that feels natural and emotionally appropriate. This could lead to much more realistic and enjoyable interactions with AI assistants and conversational agents.

Abstract

Recent advancements in large audio language models have extended Chain-of-Thought (CoT) reasoning into the auditory domain, enabling models to tackle increasingly complex acoustic and spoken tasks. To elicit and sustain these extended reasoning chains, the prevailing paradigm -- driven by the success of text-based reasoning models -- overwhelmingly relies on Reinforcement Learning with Verified Rewards (RLVR). However, as models are strictly optimized to distill rich, continuous auditory contexts into isolated, verifiable text labels, a fundamental question arises: are we fostering true audio intelligence, or merely reducing a continuous sensory medium into a discrete puzzle? We identify this as the "verifiable reward trap." While RLVR yields remarkable scores on standardized objective benchmarks, it systematically degrades the real-world conversational feel of audio models. By prioritizing isolated correctness over acoustic nuance, RLVR reduces dynamic interactions to mechanical "answering machines," severely compromising prosodic naturalness, emotional continuity, and user immersion, particularly in long-turn dialogues. To bridge the gap between mechanical objective verification and genuine sensory empathy, we introduce Step-Audio-R1.5, marking a paradigm shift toward Reinforcement Learning from Human Feedback (RLHF) in audio reasoning. Comprehensive evaluations demonstrate that Step-Audio-R1.5 not only maintains robust analytical reasoning but profoundly transforms the interactive experience, redefining the boundaries of deeply immersive long-turn spoken dialogue.

View Paper