< Explain other AI papers

German4All - A Dataset and Model for Readability-Controlled Paraphrasing in German

Miriam Anschütz, Thanh Mai Pham, Eslam Nasrallah, Maximilian Müller, Cristian-George Craciun, Georg Groh

2025-08-26

German4All - A Dataset and Model for Readability-Controlled Paraphrasing in German

Summary

This paper introduces a new dataset and model designed to make German text easier to understand for different audiences, essentially creating different versions of the same text tailored to various reading levels.

What's the problem?

It's hard to automatically create different versions of a text that are all saying the same thing but at different levels of complexity. Existing resources for doing this in German were lacking, making it difficult to build tools that could adapt text for people with varying reading abilities or language skills.

What's the solution?

The researchers created a large dataset called German4All, containing over 25,000 paragraphs of German text, each rewritten to represent five different readability levels. They used a powerful AI model, GPT-4, to generate these different versions and then checked the quality of the dataset with both humans and other AI models. They then used this dataset to train a new AI model that can automatically simplify German text, achieving really good results.

Why it matters?

This work is important because it provides the tools to make information more accessible to a wider range of people. By being able to automatically adjust the complexity of text, we can help people who are learning German, have reading difficulties, or simply prefer information presented in a simpler way. The dataset and model are also released publicly, allowing other researchers to build upon this work and improve text simplification technology.

Abstract

The ability to paraphrase texts across different complexity levels is essential for creating accessible texts that can be tailored toward diverse reader groups. Thus, we introduce German4All, the first large-scale German dataset of aligned readability-controlled, paragraph-level paraphrases. It spans five readability levels and comprises over 25,000 samples. The dataset is automatically synthesized using GPT-4 and rigorously evaluated through both human and LLM-based judgments. Using German4All, we train an open-source, readability-controlled paraphrasing model that achieves state-of-the-art performance in German text simplification, enabling more nuanced and reader-specific adaptations. We opensource both the dataset and the model to encourage further research on multi-level paraphrasing