LEMAS: Large A 150K-Hour Large-scale Extensible Multilingual Audio Suite with Generative Speech Models
Zhiyuan Zhao, Lijian Lin, Ye Zhu, Kai Xie, Yunfei Liu, Yu Li
2026-01-09
Summary
This paper introduces LEMAS-Dataset, a really large collection of speech data in ten different languages, and shows how it can be used to build better speech-related AI models.
What's the problem?
Creating AI that can convincingly generate and edit speech in multiple languages is difficult because it requires huge amounts of high-quality, accurately labeled data. Existing datasets either aren't big enough, don't cover enough languages, or lack precise timing information for each word spoken, making it hard to train these models effectively.
What's the solution?
The researchers built LEMAS-Dataset, which contains over 150,000 hours of speech across ten languages, and importantly, includes timestamps for *every* word. They then used this dataset to train two different AI models: one for creating speech from text (LEMAS-TTS) and another for editing existing speech (LEMAS-Edit). They used clever techniques like accent-adversarial training and special masking strategies to improve the quality and smoothness of the generated and edited speech.
Why it matters?
This new dataset is a big step forward because it provides researchers with the resources they need to develop more advanced speech generation and editing systems that work well in many languages. It could lead to improvements in things like voice assistants, translation tools, and even creating realistic voices for characters in video games or movies.
Abstract
We present the LEMAS-Dataset, which, to our knowledge, is currently the largest open-source multilingual speech corpus with word-level timestamps. Covering over 150,000 hours across 10 major languages, LEMAS-Dataset is constructed via a efficient data processing pipeline that ensures high-quality data and annotations. To validate the effectiveness of LEMAS-Dataset across diverse generative paradigms, we train two benchmark models with distinct architectures and task specializations on this dataset. LEMAS-TTS, built upon a non-autoregressive flow-matching framework, leverages the dataset's massive scale and linguistic diversity to achieve robust zero-shot multilingual synthesis. Our proposed accent-adversarial training and CTC loss mitigate cross-lingual accent issues, enhancing synthesis stability. Complementarily, LEMAS-Edit employs an autoregressive decoder-only architecture that formulates speech editing as a masked token infilling task. By exploiting precise word-level alignments to construct training masks and adopting adaptive decoding strategies, it achieves seamless, smooth-boundary speech editing with natural transitions. Experimental results demonstrate that models trained on LEMAS-Dataset deliver high-quality synthesis and editing performance, confirming the dataset's quality. We envision that this richly timestamp-annotated, fine-grained multilingual corpus will drive future advances in prompt-based speech generation systems.