VStyle: A Benchmark for Voice Style Adaptation with Spoken Instructions
Jun Zhan, Mingyang Han, Yuxuan Xie, Chen Wang, Dong Zhang, Kexin Huang, Haoxiang Shi, DongXiao Wang, Tengtao Song, Qinyuan Cheng, Shimin Li, Jun Song, Xipeng Qiu, Bo Zheng
2025-09-15
Summary
This paper explores how well current spoken language models can change the way they *sound* based on what you tell them to do with your voice, not just what you tell them to *say*. These models are getting good at understanding and responding to speech, but they aren't very good at changing their speaking style on command.
What's the problem?
Existing speech models focus on getting the content of speech correct and following instructions, but they largely ignore the ability to modify *how* something is said. Imagine asking a model to sound happy, sad, or like a specific character – current models struggle with this kind of nuanced control over their vocal delivery. There wasn't a good way to test this ability or a standard dataset to measure progress.
What's the solution?
The researchers created a new test called Voice Style Adaptation (VSA) and a dataset called VStyle, which includes speech examples in both Chinese and English. This dataset covers different ways to change speech, like adjusting the tone, mimicking a certain persona, or showing emotion. They also developed a system, using another large audio model, to automatically and fairly evaluate how well the models adapt their speaking style based on instructions, looking at accuracy, style, and how natural it sounds.
Why it matters?
This work is important because it highlights a gap in current speech technology. Being able to control the style of spoken output is crucial for creating more natural and engaging human-computer interactions. Think about virtual assistants, audiobooks, or even creating characters for video games – all of these would benefit from models that can truly adapt their voice to the situation. By releasing the dataset and evaluation tools, the researchers hope to encourage further development in this area.
Abstract
Spoken language models (SLMs) have emerged as a unified paradigm for speech understanding and generation, enabling natural human machine interaction. However, while most progress has focused on semantic accuracy and instruction following, the ability of SLMs to adapt their speaking style based on spoken instructions has received limited attention. We introduce Voice Style Adaptation (VSA), a new task that examines whether SLMs can modify their speaking style, such as timbre, prosody, or persona following natural language spoken commands. To study this task, we present VStyle, a bilingual (Chinese & English) benchmark covering four categories of speech generation: acoustic attributes, natural language instruction, role play, and implicit empathy. We also introduce the Large Audio Language Model as a Judge (LALM as a Judge) framework, which progressively evaluates outputs along textual faithfulness, style adherence, and naturalness, ensuring reproducible and objective assessment. Experiments on commercial systems and open source SLMs demonstrate that current models face clear limitations in controllable style adaptation, highlighting both the novelty and challenge of this task. By releasing VStyle and its evaluation toolkit, we aim to provide the community with a foundation for advancing human centered spoken interaction. The dataset and code are publicly available at https://junzhan2000.github.io/VStyle.github.io/{project's homepage}.