WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

Yifu Chen, Shengpeng Ji, Qian Chen, Tianle Liang, Yangzhuo Li, Ziqing Wang, Wen Wang, Jingyu Lu, Haoxiao Wang, Xueyi Pu, Fan Zhuo, Zhou Zhao

2026-04-23

WavAlign: Enhancing Intelligence and Expressiveness in Spoken Dialogue Models via Adaptive Hybrid Post-Training

Summary

This paper explores how to make computer programs that *talk* back to people – specifically, spoken dialogue systems – sound more natural and intelligent, using a technique called reinforcement learning.

What's the problem?

Current spoken dialogue systems, while improving, often sound robotic or don't quite understand what you mean. A promising approach is to use reinforcement learning, where the system learns by getting feedback on its responses, but directly applying this to spoken dialogue is tricky. The problem is that feedback (preferences) is often given on the *meaning* of what's said, but the system generates the actual *sound* of speech, and these two things are linked in a complex way during the learning process. The feedback signal can be weak or misleading when updating both meaning and sound at the same time.

What's the solution?

The researchers developed a new method called 'modality-aware adaptive post-training'. Essentially, they focused the learning process. They made sure updates based on feedback primarily affected the *meaning* of the response, and then separately worked on improving the *sound* quality by keeping it anchored to good examples. They also dynamically adjusted how much emphasis was placed on the feedback versus the existing speech patterns to prevent the system from learning incorrectly from unreliable signals.

Why it matters?

This work is important because it makes reinforcement learning a viable option for improving spoken dialogue systems. By addressing the challenges of combining feedback on meaning with the generation of speech, the researchers were able to create systems that are both more understandable and sound more natural, bringing us closer to truly conversational AI.

Abstract

End-to-end spoken dialogue models have garnered significant attention because they offer a higher potential ceiling in expressiveness and perceptual ability than cascaded systems. However, the intelligence and expressiveness of current open-source spoken dialogue models often remain below expectations. Motivated by the success of online reinforcement learning(RL) in other domains, one might attempt to directly apply preference optimization to spoken dialogue models, yet this transfer is non-trivial. We analyze these obstacles from the perspectives of reward modeling and rollout sampling, focusing on how sparse preference supervision interacts with dense speech generation under shared-parameter updates. Based on the analysis, we propose a modality-aware adaptive post-training recipe that makes RL practical for spoken dialogue: it constrains preference updates to the semantic channel and improves acoustic behavior via explicit anchoring, while dynamically regulating their mixture from rollout statistics to avoid unreliable preference gradients. We evaluate the method across multiple spoken dialogue benchmarks and representative architectures, and observe consistent improvements in semantic quality and speech expressiveness.

View Paper