Zero-shot Cross-lingual Voice Transfer for TTS

Fadi Biadsy, Youzheng Chen, Isaac Elias, Kyle Kastner, Gary Wang, Andrew Rosenberg, Bhuvana Ramabhadran

2024-09-24

Zero-shot Cross-lingual Voice Transfer for TTS

Summary

This paper introduces a new technology called zero-shot Voice Transfer (VT) that allows a person's voice to be transferred across different languages using just a short audio sample. This technology can also help restore the voices of individuals who have speech difficulties.

What's the problem?

Many people lose their ability to speak clearly due to medical conditions, making it hard for them to communicate. Traditional voice transfer methods require many recordings of a person's voice, which isn't always possible, especially for those who haven't saved their voice before losing it. Additionally, existing systems often struggle to transfer voices accurately when switching languages.

What's the solution?

To solve these issues, the researchers developed a VT module that can work with minimal input—just a single short audio sample of the person's voice. This module uses advanced techniques to analyze and replicate the unique characteristics of the speaker's voice, allowing it to produce speech in different languages while maintaining the original voice's quality. The system was tested and showed that it could successfully transfer voices across nine languages with a high similarity score, even when using atypical speech samples from individuals with speech challenges.

Why it matters?

This research is crucial because it offers hope for individuals who have lost their voices due to health issues. By enabling voice restoration and cross-lingual capabilities, this technology can improve communication for many people, allowing them to express themselves more easily and regain a sense of identity. Moreover, it opens up new possibilities for text-to-speech applications in various languages, making technology more accessible.

Abstract

In this paper, we introduce a zero-shot Voice Transfer (VT) module that can be seamlessly integrated into a multi-lingual Text-to-speech (TTS) system to transfer an individual's voice across languages. Our proposed VT module comprises a speaker-encoder that processes reference speech, a bottleneck layer, and residual adapters, connected to preexisting TTS layers. We compare the performance of various configurations of these components and report Mean Opinion Score (MOS) and Speaker Similarity across languages. Using a single English reference speech per speaker, we achieve an average voice transfer similarity score of 73% across nine target languages. Vocal characteristics contribute significantly to the construction and perception of individual identity. The loss of one's voice, due to physical or neurological conditions, can lead to a profound sense of loss, impacting one's core identity. As a case study, we demonstrate that our approach can not only transfer typical speech but also restore the voices of individuals with dysarthria, even when only atypical speech samples are available - a valuable utility for those who have never had typical speech or banked their voice. Cross-lingual typical audio samples, plus videos demonstrating voice restoration for dysarthric speakers are available here (google.github.io/tacotron/publications/zero_shot_voice_transfer).

View Paper