EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control

Haozhe Chen, Run Chen, Julia Hirschberg

2024-10-03

EmoKnob: Enhance Voice Cloning with Fine-Grained Emotion Control

Summary

This paper introduces EmoKnob, a new system that enhances voice cloning technology by allowing users to control the emotions expressed in synthesized speech.

What's the problem?

While recent advancements in Text-to-Speech (TTS) technology have made voices sound more natural and expressive, they still do not allow users to choose specific emotions or how strong those emotions should be. This limits the ability to create personalized and engaging voice experiences, making them sound robotic or monotone.

What's the solution?

EmoKnob addresses this issue by providing a framework that enables fine-grained control over emotions in speech synthesis. It uses a few examples of different emotions to teach the system how to adjust the emotional tone of the voice. The authors developed two methods for applying this emotion control based on open-ended text descriptions, allowing users to easily express a wide range of emotions. They also created new evaluation metrics to measure how well the system captures and conveys these emotions. Their experiments showed that EmoKnob produces more expressive speech than existing commercial TTS systems.

Why it matters?

This research is important because it opens up new possibilities for creating more relatable and human-like AI voices. By allowing users to control the emotional tone of synthesized speech, EmoKnob can improve applications like virtual assistants, audiobooks, and video games, making them more immersive and enjoyable for users.

Abstract

While recent advances in Text-to-Speech (TTS) technology produce natural and expressive speech, they lack the option for users to select emotion and control intensity. We propose EmoKnob, a framework that allows fine-grained emotion control in speech synthesis with few-shot demonstrative samples of arbitrary emotion. Our framework leverages the expressive speaker representation space made possible by recent advances in foundation voice cloning models. Based on the few-shot capability of our emotion control framework, we propose two methods to apply emotion control on emotions described by open-ended text, enabling an intuitive interface for controlling a diverse array of nuanced emotions. To facilitate a more systematic emotional speech synthesis field, we introduce a set of evaluation metrics designed to rigorously assess the faithfulness and recognizability of emotion control frameworks. Through objective and subjective evaluations, we show that our emotion control framework effectively embeds emotions into speech and surpasses emotion expressiveness of commercial TTS services.

View Paper