Marco-Voice Technical Report

Fengping Tian, Chenyang Lyu, Xuanfan Ni, Haoqin Sun, Qingjuan Li, Zhiqiang Qian, Haijun Li, Longyue Wang, Zhao Xu, Weihua Luo, Kaifu Zhang

2025-08-08

Summary

This paper talks about Marco-Voice, a speech synthesis system that can create natural-sounding speech by copying different voices and controlling the emotions in the speech separately.

What's the problem?

The problem is that many speech synthesis systems struggle to produce speech that sounds both like a specific person and expresses emotions clearly at the same time, which makes them less realistic and expressive.

What's the solution?

The solution was to design a system that separates the voice characteristics from emotions by using special techniques called speaker-emotion disentanglement and rotational emotional embeddings, allowing the system to control voice and emotion independently and produce more lifelike speech.

Why it matters?

This matters because having speech that sounds natural and emotionally expressive can improve communication in virtual assistants, games, and other AI applications, making interactions more engaging and human-like.

Abstract

A multifunctional speech synthesis system integrates voice cloning and emotion control using speaker-emotion disentanglement and rotational emotional embeddings, achieving high expressive and natural speech.

View Paper