Judging Quality Across Languages: A Multilingual Approach to Pretraining Data Filtering with Language Models

Mehdi Ali, Manuel Brack, Max Lübbering, Elias Wendt, Abbas Goher Khan, Richard Rutmann, Alex Jude, Maurice Kraus, Alexander Arno Weber, Felix Stollenwerk, David Kaczér, Florian Mai, Lucie Flek, Rafet Sifa, Nicolas Flores-Herr, Joachim Köhler, Patrick Schramowski, Michael Fromm, Kristian Kersting

2025-05-29

Judging Quality Across Languages: A Multilingual Approach to Pretraining
Data Filtering with Language Models

Summary

This paper talks about JQL, a new method for picking out the best training data in many different languages to help AI models learn more effectively.

What's the problem?

The problem is that when training AI models to understand and use multiple languages, the quality of the training data can vary a lot, and old methods for filtering out bad data aren't very accurate, which can hurt how well the models learn.

What's the solution?

To solve this, the researchers used advanced language models that already know multiple languages to judge and select only the highest quality training examples. This systematic approach works better than simple rule-based methods and helps build stronger, more reliable models.

Why it matters?

This is important because it leads to smarter AI that can understand and work with many languages, making technology more accessible and useful for people all around the world.

Abstract

JQL systematically curates high-quality multilingual training data using pretrained multilingual embeddings, outperforming heuristic methods and improving downstream model training across diverse languages.

View Paper