Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, Zhizheng Wu

2025-01-28

Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation

Summary

This paper talks about how Vision Language Models (VLMs) process visual information differently from traditional computer vision models. It explores whether VLMs focus more on the texture or shape of objects in images, and if we can influence this focus through language prompts.

What's the problem?

Computer vision models have typically been more focused on texture rather than shape when identifying objects, which is different from how humans see things. As VLMs combine both visual and language processing, researchers wanted to know if these new models behave more like humans in focusing on shape, or if they inherit the texture bias from traditional vision models.

What's the solution?

The researchers studied various popular VLMs to see how they process visual information. They found that VLMs tend to focus more on shape than their vision-only counterparts, suggesting that adding language processing influences how these models 'see' images. They also experimented with using different language prompts to see if they could steer the models to focus even more on shape. Through these experiments, they were able to increase the models' focus on shape from 49% to 72% just by changing the text prompts.

Why it matters?

This research matters because it helps us understand how AI 'sees' the world and how close it is to human perception. If we can make AI see more like humans do, it could lead to better and more intuitive AI systems for tasks like image recognition, visual question answering, and even robotics. The ability to steer these models using language is particularly exciting, as it suggests we might be able to make AI adapt its visual processing on the fly for different tasks, just by giving it different instructions. While the models still don't match human levels of shape focus, this research opens up new possibilities for improving AI vision systems.

Abstract

Recent advancements in speech generation have been driven by the large-scale training datasets. However, current models fall short of capturing the spontaneity and variability inherent in real-world human speech, due to their reliance on audiobook datasets limited to formal read-aloud speech styles. To bridge this gap, we introduce Emilia-Pipe, an open-source preprocessing pipeline to extract high-quality training data from valuable yet underexplored in-the-wild data that capture spontaneous human speech in real-world contexts. By leveraging Emilia-Pipe, we construct Emilia, the first multilingual speech generation dataset derived from in-the-wild speech data. This dataset comprises over 101k hours of speech across six languages: English, Chinese, German, French, Japanese, and Korean. Besides, we expand Emilia to Emilia-Large, a dataset exceeding 216k hours, making it the largest open-source speech generation dataset available. Extensive experiments demonstrate that Emilia significantly outperforms traditional audiobook datasets in generating spontaneous and human-like speech, showcasing superior performance in capturing diverse speaker timbre and speaking styles of real-world human speech. Furthermore, this work underscores the importance of scaling dataset size to advance speech generation research and validates the effectiveness of Emilia for both multilingual and crosslingual speech generation.

View Paper