Adapting LLMs to Hebrew: Unveiling DictaLM 2.0 with Enhanced Vocabulary and Instruction Capabilities
Shaltiel Shmidman, Avi Shmidman, Amir DN Cohen, Moshe Koppel
2024-07-10

Summary
This paper talks about DictaLM 2.0 and DictaLM 2.0-Instruct, two advanced language models specifically designed for the Hebrew language. These models aim to improve how well AI can understand and generate Hebrew text by using a large amount of training data.
What's the problem?
The main problem is that training large language models (LLMs) for languages like Hebrew, which have less available data compared to languages like English, presents unique challenges. Existing models often struggle with the specific linguistic features of Hebrew, making it hard for them to generate accurate and natural-sounding text.
What's the solution?
To address this issue, the authors developed DictaLM 2.0 and DictaLM 2.0-Instruct, which are based on the Mistral model and trained on about 200 billion tokens of text in both Hebrew and English. They used special techniques to adapt these models to the unique characteristics of Hebrew. Additionally, they fine-tuned DictaLM 2.0-Instruct on a dataset focused on instructions to improve its ability to follow task-specific commands. The authors also created a new benchmark suite to evaluate how well these models perform on various tasks like question answering and translation.
Why it matters?
This research is important because it helps improve AI's ability to work with low-resource languages like Hebrew. By developing better models for Hebrew, this work contributes to the broader field of multilingual natural language processing (NLP), making it easier for AI systems to understand and generate text in many different languages. This can lead to better communication tools, educational resources, and applications that support Hebrew speakers.
Abstract
Training large language models (LLMs) in low-resource languages such as Hebrew poses unique challenges. In this paper, we introduce DictaLM2.0 and DictaLM2.0-Instruct, two LLMs derived from the Mistral model, trained on a substantial corpus of approximately 200 billion tokens in both Hebrew and English. Adapting a pre-trained model to a new language involves specialized techniques that differ significantly from training a model from scratch or further training existing models on well-resourced languages such as English. We outline these novel training methodologies, which facilitate effective learning and adaptation to the linguistic properties of Hebrew. Additionally, we fine-tuned DictaLM2.0-Instruct on a comprehensive instruct dataset to enhance its performance on task-specific instructions. To rigorously evaluate our models, we introduce a new benchmark suite for Hebrew LLM evaluation, covering a diverse set of tasks including Question Answering, Sentiment Analysis, Winograd Schema Challenge, Translation, and Summarization. Our work not only addresses the intricacies of training LLMs in low-resource languages but also proposes a framework that can be leveraged for adapting other LLMs to various non-English languages, contributing to the broader field of multilingual NLP.