GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages

Amir Hossein Kargaran, François Yvon, Hinrich Schütze

2024-11-01

GlotCC: An Open Broad-Coverage CommonCrawl Corpus and Pipeline for Minority Languages

Summary

This paper introduces GlotCC, a new text dataset that focuses on minority languages. It aims to provide a large, clean, and trustworthy collection of text data to help improve language models for languages that are often overlooked.

What's the problem?

Many existing text datasets mainly cover widely spoken languages, leaving minority languages with little or no representation. This lack of data makes it difficult for researchers to develop language models that can understand and work with these less common languages.

What's the solution?

GlotCC is a comprehensive 2TB dataset derived from CommonCrawl, which includes text from over 1,000 different languages. The authors created a reproducible system to generate this dataset, ensuring it is well-organized and free from noise. They also made the tools used to create GlotCC available to the research community so others can use or build upon their work.

Why it matters?

This research is important because it helps fill the gap in language resources for minority languages. By providing a reliable dataset, GlotCC enables better training of language models that can understand and generate text in these languages, ultimately supporting cultural preservation and communication in diverse communities.

Abstract

The need for large text corpora has increased with the advent of pretrained language models and, in particular, the discovery of scaling laws for these models. Most available corpora have sufficient data only for languages with large dominant communities. However, there is no corpus available that (i) covers a wide range of minority languages; (ii) is generated by an open-source reproducible pipeline; and (iii) is rigorously cleaned from noise, making it trustworthy to use. We present GlotCC, a clean, document-level, 2TB general domain corpus derived from CommonCrawl, covering more than 1000 languages. We make GlotCC and the system used to generate it - including the pipeline, language identification model, and filters - available to the research community. Corpus v. 1.0 https://huggingface.co/datasets/cis-lmu/GlotCC-v1, Pipeline v. 3.0 https://github.com/cisnlp/GlotCC.

View Paper