Apertus: Democratizing Open and Compliant LLMs for Global Language Environments
Alejandro Hernández-Cano, Alexander Hägele, Allen Hao Huang, Angelika Romanou, Antoni-Joan Solergibert, Barna Pasztor, Bettina Messmer, Dhia Garbaya, Eduard Frank Ďurech, Ido Hakimi, Juan García Giraldo, Mete Ismayilzada, Negar Foroutan, Skander Moalla, Tiancheng Chen, Vinko Sabolčec, Yixuan Xu, Michael Aerni, Badr AlKhamissi, Ines Altemir Marinas, Mohammad Hossein Amani, Matin Ansaripour
2025-09-19
Summary
This paper introduces Apertus, a new set of large language models that are completely open and available to everyone, focusing on building these models in a responsible and inclusive way.
What's the problem?
Currently, many open-source large language models have issues with how their training data is collected and used. Often, they don't clearly show where the data came from, potentially violating copyright or including harmful content. Also, most models are heavily focused on English and don't perform well with other languages, limiting their usefulness for a global audience.
What's the solution?
The creators of Apertus addressed these problems by only using data that is openly licensed and carefully checking it to respect website rules (robots.txt) and remove inappropriate or private information. They also used a special training technique called 'Goldfish' to prevent the model from simply memorizing and repeating training data, while still maintaining its ability to perform tasks. Finally, they trained Apertus on a massive amount of text from over 1800 languages, with a significant portion being non-English, to improve its multilingual capabilities.
Why it matters?
Apertus is important because it provides a truly open and ethical alternative to existing large language models. By releasing not just the model itself, but also all the code and data used to create it, researchers and developers can fully understand, audit, and improve upon the work. Its strong multilingual support also makes it more accessible and useful for people around the world, and sets a new standard for responsible AI development.
Abstract
We present Apertus, a fully open suite of large language models (LLMs) designed to address two systemic shortcomings in today's open model ecosystem: data compliance and multilingual representation. Unlike many prior models that release weights without reproducible data pipelines or regard for content-owner rights, Apertus models are pretrained exclusively on openly available data, retroactively respecting robots.txt exclusions and filtering for non-permissive, toxic, and personally identifiable content. To mitigate risks of memorization, we adopt the Goldfish objective during pretraining, strongly suppressing verbatim recall of data while retaining downstream task performance. The Apertus models also expand multilingual coverage, training on 15T tokens from over 1800 languages, with ~40% of pretraining data allocated to non-English content. Released at 8B and 70B scales, Apertus approaches state-of-the-art results among fully open models on multilingual benchmarks, rivalling or surpassing open-weight counterparts. Beyond model weights, we release all scientific artifacts from our development cycle with a permissive license, including data preparation scripts, checkpoints, evaluation suites, and training code, enabling transparent audit and extension.