Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian

Michael Hoffmann, Jophin John, Stefan Schweter, Gokul Ramakrishnan, Hoi-Fong Mak, Alice Zhang, Dmitry Gaynullin, Nicolay J. Hammer

2025-09-09

Llama-GENBA-10B: A Trilingual Large Language Model for German, English and Bavarian

Summary

This paper introduces Llama-GENBA-10B, a new language model designed to work well in English, German, and a less common language called Bavarian. It aims to fix the problem where many large language models are heavily biased towards English and don't perform as well in other languages.

What's the problem?

Most large language models are trained primarily on English text, which means they often struggle with other languages, especially those with fewer digital resources like Bavarian. This creates a bias where the model understands and performs better in English, limiting its usefulness for people who speak other languages. Building a model that fairly represents multiple languages, particularly a low-resource one like Bavarian, is a significant challenge.

What's the solution?

The researchers started with an existing model called Llama 3.1-8B and expanded it to 10 billion parameters, naming it Llama-GENBA-10B. They then trained it on a massive amount of text – a balanced mix of English, German, and Bavarian – to ensure no single language dominated. They overcame hurdles like finding enough Bavarian text, creating a single system to understand all three languages, and figuring out the best way to train the model for good performance across all languages. They also created a new set of tests, translating existing German tests into Bavarian, to accurately measure the model’s abilities.

Why it matters?

This work is important because it demonstrates how to build a language model that is more inclusive and works well for a wider range of languages. By specifically focusing on Bavarian, a language that doesn't have a lot of digital resources, the researchers provide a model and a process that can be used to support other low-resource languages. It also shows that it’s possible to create powerful multilingual models efficiently, and provides data on the energy used during training, which is important for sustainable AI development.

Abstract

We present Llama-GENBA-10B, a trilingual foundation model addressing English-centric bias in large language models. Built on Llama 3.1-8B and scaled to 10B parameters, Llama-GENBA-10B is continuously pretrained on 164B tokens (82B English, 82B German, and 80M Bavarian), balancing resources while preventing English dominance. Targeted at the German NLP community, the model also promotes Bavarian as a low-resource language. Development tackled four challenges: (1) curating a multilingual corpus despite Bavarian scarcity, (2) creating a unified tokenizer for English, German, and Bavarian, (3) optimizing architecture and language-ratio hyperparameters for cross-lingual transfer, and (4) establishing the first standardized trilingual evaluation suite by translating German benchmarks into Bavarian. Evaluations show that Llama-GENBA-10B achieves strong cross-lingual performance, with the fine-tuned variant surpassing Apertus-8B-2509 and gemma-2-9b in Bavarian and establishing itself as the best model in its class for this language, while also outperforming EuroLLM in English and matching its results in German. Training on the Cerebras CS-2 demonstrated efficient large-scale multilingual pretraining with documented energy use, offering a blueprint for inclusive foundation models that integrate low-resource languages.

View Paper