< Explain other AI papers

The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

Lukas Gienapp, Christopher Schröder, Stefan Schweter, Christopher Akiki, Ferdinand Schlatt, Arden Zimmermann, Phillipe Genêt, Martin Potthast

2025-10-17

The German Commons - 154 Billion Tokens of Openly Licensed Text for German Language Models

Summary

This paper introduces a large, openly licensed collection of German text data called the German Commons, designed to help build better German language models.

What's the problem?

Developing powerful language models requires massive amounts of text data for training, but much of the existing data isn't clearly licensed, meaning it's legally risky to use for creating and sharing models. This is especially true for languages other than English, where finding freely usable text is very difficult, hindering the creation of open-source language tools for those languages.

What's the solution?

The researchers created the German Commons by carefully gathering text from 41 different sources covering a wide range of topics like law, science, news, and the web. They made sure all the data had licenses allowing free use and modification, like CC-BY-SA 4.0. They then cleaned up the data, removing duplicates and fixing formatting issues, resulting in a huge dataset of over 154 billion words. They also shared the code they used to build and filter the dataset so others can reproduce and expand upon their work.

Why it matters?

The German Commons is important because it provides a crucial resource for developing truly open and accessible German language models. Before this, it was hard to build these models without risking legal issues. Now, researchers and developers can freely use this data to create new tools and applications in German, fostering innovation and wider access to language technology.

Abstract

Large language model development relies on large-scale training corpora, yet most contain data of unclear licensing status, limiting the development of truly open models. This problem is exacerbated for non-English languages, where openly licensed text remains critically scarce. We introduce the German Commons, the largest collection of openly licensed German text to date. It compiles data from 41 sources across seven domains, encompassing legal, scientific, cultural, political, news, economic, and web text. Through systematic sourcing from established data providers with verifiable licensing, it yields 154.56 billion tokens of high-quality text for language model training. Our processing pipeline implements comprehensive quality filtering, deduplication, and text formatting fixes, ensuring consistent quality across heterogeneous text sources. All domain subsets feature licenses of at least CC-BY-SA 4.0 or equivalent, ensuring legal compliance for model training and redistribution. The German Commons therefore addresses the critical gap in openly licensed German pretraining data, and enables the development of truly open German language models. We also release code for corpus construction and data filtering tailored to German language text, rendering the German Commons fully reproducible and extensible.