MOSEL: 950,000 Hours of Speech Data for Open-Source Speech Foundation Model Training on EU Languages
Marco Gaido, Sara Papi, Luisa Bentivogli, Alessio Brutti, Mauro Cettolo, Roberto Gretter, Marco Matassoni, Mohamed Nabih, Matteo Negri
2024-10-03

Summary
This paper discusses MOSEL, a project that has collected 950,000 hours of open-source speech data to help train AI models for the 24 official languages of the European Union.
What's the problem?
Many existing speech models are not truly open-source because they lack publicly available training data, model weights, and code. This makes it difficult for researchers and developers to create effective speech recognition systems for various languages, particularly those in the European Union, where some languages have very little training data.
What's the solution?
To address this issue, the researchers gathered a massive dataset of 950,000 hours of speech data from various sources, including both labeled (transcribed) and unlabeled audio. They also created automatic transcripts for 441,000 hours of unlabeled audio using advanced AI tools. This dataset is released under an open-source license, allowing anyone to use it for developing speech models. The project aims to improve the availability of high-quality speech data for all EU languages, especially those that are less commonly represented.
Why it matters?
This research is important because it provides a significant resource for developing speech recognition technologies that can work in multiple languages. By making this data available, it helps promote inclusivity and accessibility in AI applications across Europe, ensuring that even low-resource languages receive attention and support in technology development.
Abstract
The rise of foundation models (FMs), coupled with regulatory efforts addressing their risks and impacts, has sparked significant interest in open-source models. However, existing speech FMs (SFMs) fall short of full compliance with the open-source principles, even if claimed otherwise, as no existing SFM has model weights, code, and training data publicly available under open-source terms. In this work, we take the first step toward filling this gap by focusing on the 24 official languages of the European Union (EU). We collect suitable training data by surveying automatic speech recognition datasets and unlabeled speech corpora under open-source compliant licenses, for a total of 950k hours. Additionally, we release automatic transcripts for 441k hours of unlabeled data under the permissive CC-BY license, thereby facilitating the creation of open-source SFMs for the EU languages.