Fietje: An open, efficient LLM for Dutch

Bram Vanroy

2024-12-23

Fietje: An open, efficient LLM for Dutch

Summary

This paper talks about Fietje, a new family of small language models specifically designed for the Dutch language. It aims to provide an efficient and effective tool for processing Dutch text using a model based on the larger Phi 2 architecture.

What's the problem?

Many existing language models focus primarily on English, which leaves speakers of other languages, like Dutch, with fewer options. Traditional models can be large and require significant computing power, making them hard to use for everyday tasks or by researchers without access to expensive resources.

What's the solution?

Fietje addresses this problem by being a smaller model (2.7 billion parameters) that is specifically trained on Dutch text. It uses a large dataset of 28 billion Dutch tokens to learn the language better. The model is open-source, meaning anyone can access its code and data, making it easier for users to work with Dutch language technology. Fietje has shown competitive performance in various tasks compared to larger models, demonstrating that smaller models can still be very effective.

Why it matters?

This research is important because it makes advanced language processing technology more accessible to Dutch speakers. By providing an efficient and open-source model, Fietje encourages more people to use and develop applications in Dutch, helping to promote linguistic diversity in AI technology.

Abstract

This paper introduces Fietje, a family of small language models (SLMs) specifically designed for the Dutch language. The model is based on Phi 2, an English-centric model of 2.7 billion parameters. Fietje demonstrated competitive results with larger language models upon its release. A core emphasis of this work is transparency and reproducibility: Fietje is fully open-source, with model weights, datasets, training, and evaluation code all publicly accessible. The paper discusses the performance of Fietje and many other models on an extensive evaluation suite of benchmarks on reasoning, sentiment analysis, world knowledge, linguistic acceptability and word sense disambiguation. Evaluation results illustrate the rapid progress in the field of LLMs, where recent small models outperform older, larger models that were fine-tuned for Dutch. This trend signals an exciting future for Dutch language processing, suggesting that even compact LLMs are becoming increasingly capable. Furthermore, ongoing and future efforts to adapt LLMs to Dutch are poised to enhance these models even further, broadening their applicability and accessibility. Fietje is only an intermediate step in improving accessibility to language technology for users of the Dutch language.

View Paper