Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

Beomseok Lee, Ioan Calapodescu, Marco Gaido, Matteo Negri, Laurent Besacier

2024-08-08

Speech-MASSIVE: A Multilingual Speech Dataset for SLU and Beyond

Summary

This paper introduces Speech-MASSIVE, a large dataset designed to improve spoken language understanding (SLU) by providing speech data in 12 different languages.

What's the problem?

There is a lack of extensive multilingual datasets for training AI systems to understand spoken language. Most existing datasets focus on a limited number of languages or do not provide enough examples for effective training. This scarcity makes it difficult for AI models to accurately understand and process speech in different languages, especially less commonly spoken ones.

What's the solution?

The authors created Speech-MASSIVE, which includes transcribed speech data from 12 languages, covering various tasks like intent prediction and slot-filling. This dataset not only provides a wealth of audio recordings but also includes detailed annotations to help train AI models effectively. They also report baseline results using different training methods, allowing researchers to benchmark their models against established standards.

Why it matters?

This research is significant because it fills a major gap in the availability of multilingual speech datasets. By providing a comprehensive resource for training AI systems, Speech-MASSIVE can help improve the performance of voice assistants and other applications that rely on understanding spoken language across different cultures and languages.

Abstract

We present Speech-MASSIVE, a multilingual Spoken Language Understanding (SLU) dataset comprising the speech counterpart for a portion of the MASSIVE textual corpus. Speech-MASSIVE covers 12 languages from different families and inherits from MASSIVE the annotations for the intent prediction and slot-filling tasks. Our extension is prompted by the scarcity of massively multilingual SLU datasets and the growing need for versatile speech datasets to assess foundation models (LLMs, speech encoders) across languages and tasks. We provide a multimodal, multitask, multilingual dataset and report SLU baselines using both cascaded and end-to-end architectures in various training scenarios (zero-shot, few-shot, and full fine-tune). Furthermore, we demonstrate the suitability of Speech-MASSIVE for benchmarking other tasks such as speech transcription, language identification, and speech translation. The dataset, models, and code are publicly available at: https://github.com/hlt-mt/Speech-MASSIVE

View Paper