Marco-Voice Technical Report
Fengping Tian, Chenyang Lyu, Xuanfan Ni, Haoqin Sun, Qingjuan Li, Zhiqiang Qian, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, Kaifu Zhang
2025-08-08
Summary
This paper talks about Marco-Voice, a speech synthesis system that can create natural-sounding speech by copying different voices and controlling the emotions in the speech separately.
What's the problem?
The problem is that many speech synthesis systems struggle to produce speech that sounds both like a specific person and expresses emotions clearly at the same time, which makes them less realistic and expressive.
What's the solution?
The solution was to design a system that separates the voice characteristics from emotions by using special techniques called speaker-emotion disentanglement and rotational emotional embeddings, allowing the system to control voice and emotion independently and produce more lifelike speech.
Why it matters?
This matters because having speech that sounds natural and emotionally expressive can improve communication in virtual assistants, games, and other AI applications, making interactions more engaging and human-like.
Abstract
A multifunctional speech synthesis system integrates voice cloning and emotion control using speaker-emotion disentanglement and rotational emotional embeddings, achieving high expressive and natural speech.