YouTube-SL-25: A Large-Scale, Open-Domain Multilingual Sign Language Parallel Corpus
Garrett Tanzer, Biao Zhang
2024-07-17

Summary
This paper introduces YouTube-SL-25, a large and diverse dataset of sign language videos that includes well-aligned captions, aimed at improving machine learning research for sign languages.
What's the problem?
Research in sign languages, even for those like American Sign Language (ASL), often faces a major hurdle due to the lack of sufficient data. This problem is even more severe for many other sign languages used by Deaf and Hard of Hearing communities worldwide. Without enough data, it's hard to develop effective machine learning models that can understand or translate sign languages accurately.
What's the solution?
To address this issue, the authors created YouTube-SL-25, which contains over 3,000 hours of videos across more than 25 different sign languages. This dataset is more than three times larger than the previous largest dataset for ASL and is the first or largest dataset for many of the included sign languages. The authors also provide baseline results for translating sign language into text using a unified model, demonstrating that their dataset can help improve translation tasks across both high-resource and low-resource sign languages.
Why it matters?
This research is important because it provides a valuable resource that can significantly advance the study and application of sign language in technology. By making this large dataset available, it opens up opportunities for better communication tools for Deaf and Hard of Hearing individuals, improving accessibility and inclusivity in various domains such as education, entertainment, and everyday interactions.
Abstract
Even for better-studied sign languages like American Sign Language (ASL), data is the bottleneck for machine learning research. The situation is worse yet for the many other sign languages used by Deaf/Hard of Hearing communities around the world. In this paper, we present YouTube-SL-25, a large-scale, open-domain multilingual corpus of sign language videos with seemingly well-aligned captions drawn from YouTube. With >3000 hours of videos across >25 sign languages, YouTube-SL-25 is a) >3x the size of YouTube-ASL, b) the largest parallel sign language dataset to date, and c) the first or largest parallel dataset for many of its component languages. We provide baselines for sign-to-text tasks using a unified multilingual multitask model based on T5 and report scores on benchmarks across 4 sign languages. The results demonstrate that multilingual transfer benefits both higher- and lower-resource sign languages within YouTube-SL-25.