Meltemi: The first open Large Language Model for Greek

Leon Voukoutis, Dimitris Roussis, Georgios Paraskevopoulos, Sokratis Sofianopoulos, Prokopis Prokopidis, Vassilis Papavasileiou, Athanasios Katsamanis, Stelios Piperidis, Vassilis Katsouros

2024-07-31

Meltemi: The first open Large Language Model for Greek

Summary

This paper introduces Meltemi 7B, the first open Large Language Model (LLM) specifically designed for the Greek language. It is built to understand and generate text in Greek, making it a valuable tool for various applications.

What's the problem?

Many existing language models are primarily focused on high-resource languages like English, leaving less-resourced languages such as Greek with limited support. This lack of dedicated models makes it difficult for speakers of these languages to access advanced AI tools that can understand and generate their language effectively.

What's the solution?

To address this issue, the authors developed Meltemi 7B, which has 7 billion parameters and is trained on a massive dataset of 40 billion tokens in Greek. The model was built on top of another model called Mistral and was specifically adapted to improve its performance in Greek. Additionally, a version called Meltemi 7B Instruct was created to enhance its capabilities for chat applications by ensuring it follows instructions properly and avoids toxic content. Both models were evaluated on various tasks to demonstrate their effectiveness.

Why it matters?

This research is significant because it provides the first comprehensive language model for Greek, enabling better access to AI technologies for Greek speakers. By improving the understanding and generation of Greek text, Meltemi can be used in numerous applications, such as education, translation, and content creation, ultimately helping to preserve and promote the Greek language in the digital age.

Abstract

We describe the development and capabilities of Meltemi 7B, the first open Large Language Model for the Greek language. Meltemi 7B has 7 billion parameters and is trained on a 40 billion token Greek corpus. For the development of Meltemi 7B, we adapt Mistral, by continuous pretraining on the Greek Corpus. Meltemi 7B contains up-to-date information up to September 2023. Furthermore, we have translated and curated a Greek instruction corpus, which has been used for the instruction-tuning of a chat model, named Meltemi 7B Instruct. Special care has been given to the alignment and the removal of toxic content for the Meltemi 7B Instruct. The developed models are evaluated on a broad set of collected evaluation corpora, and examples of prompts and responses are presented. Both Meltemi 7B and Meltemi 7B Instruct are available at https://huggingface.co/ilsp under the Apache 2.0 license.

View Paper