Gaperon: A Peppered English-French Generative Language Model Suite

Nathan Godey, Wissam Antoun, Rian Touchent, Rachel Bawden, Éric de la Clergerie, Benoît Sagot, Djamé Seddah

2025-10-30

Gaperon: A Peppered English-French Generative Language Model Suite

Summary

This paper introduces Gaperon, a set of freely available language models that understand and generate both French and English, and even code. The researchers didn't just release the final models, but *everything* needed to recreate their work, aiming for complete transparency in how these large AI systems are built.

What's the problem?

Building really good AI language models requires massive amounts of data and computing power. However, it's often unclear exactly *how* these models are trained, what data they use, and how that impacts their performance and potential biases. There's a tricky balance between making the data 'clean' and high-quality, and accidentally including data that lets the model cheat on tests, or unintentionally making it unsafe. Researchers also need ways to test how vulnerable these models are to malicious attacks.

What's the solution?

The researchers created Gaperon models of different sizes (1.5 billion, 8 billion, and 24 billion parameters) and trained them on a huge amount of text. Crucially, they released not only the models themselves, but also the datasets they used (after filtering for quality), the code for training, and even snapshots of the models during training. They experimented with different data filtering techniques and deliberately added some 'contaminated' data to see how it affected performance. They also intentionally introduced some harmless 'poisoning' of the data to create a way to test the model's safety.

Why it matters?

Gaperon is important because it provides a fully open and reproducible platform for studying large language models. This means other researchers can examine every step of the process, verify the results, and build upon this work. It helps us understand the trade-offs between data quality, performance on tests, safety, and making these powerful AI tools openly available to everyone.

Abstract

We release Gaperon, a fully open suite of French-English-coding language models designed to advance transparency and reproducibility in large-scale model training. The Gaperon family includes 1.5B, 8B, and 24B parameter models trained on 2-4 trillion tokens, released with all elements of the training pipeline: French and English datasets filtered with a neural quality classifier, an efficient data curation and training framework, and hundreds of intermediate checkpoints. Through this work, we study how data filtering and contamination interact to shape both benchmark and generative performance. We find that filtering for linguistic quality enhances text fluency and coherence but yields subpar benchmark results, and that late deliberate contamination -- continuing training on data mixes that include test sets -- recovers competitive scores while only reasonably harming generation quality. We discuss how usual neural filtering can unintentionally amplify benchmark leakage. To support further research, we also introduce harmless data poisoning during pretraining, providing a realistic testbed for safety studies. By openly releasing all models, datasets, code, and checkpoints, Gaperon establishes a reproducible foundation for exploring the trade-offs between data curation, evaluation, safety, and openness in multilingual language model development.

View Paper