Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings

Gili Goldin, Shuly Wintner

2024-07-31

Knesset-DictaBERT: A Hebrew Language Model for Parliamentary Proceedings

Summary

This paper introduces Knesset-DictaBERT, a specialized Hebrew language model designed to understand and analyze the language used in Israeli parliamentary proceedings. It is built on the DictaBERT architecture and has been fine-tuned for better performance in understanding parliamentary language.

What's the problem?

Most language models are not specifically trained to handle the unique vocabulary and structure of parliamentary discussions, especially in Hebrew. This lack of tailored models makes it difficult to accurately analyze and interpret the language used in political settings, which can hinder effective communication and understanding in government processes.

What's the solution?

To solve this problem, the authors developed Knesset-DictaBERT by training it on a large dataset of transcripts from the Knesset, Israel's parliament. This model was fine-tuned to improve its ability to perform tasks related to parliamentary language, such as understanding context and meaning in debates. The researchers evaluated its performance and found that it significantly outperformed previous models in tasks like predicting words and understanding complex sentences.

Why it matters?

This research is important because it provides a valuable tool for analyzing political discourse in Hebrew. By improving how well models can understand parliamentary language, Knesset-DictaBERT can help policymakers, researchers, and the public better engage with political discussions and decisions. This model could also serve as a foundation for future advancements in natural language processing for other less-represented languages.

Abstract

We present Knesset-DictaBERT, a large Hebrew language model fine-tuned on the Knesset Corpus, which comprises Israeli parliamentary proceedings. The model is based on the DictaBERT architecture and demonstrates significant improvements in understanding parliamentary language according to the MLM task. We provide a detailed evaluation of the model's performance, showing improvements in perplexity and accuracy over the baseline DictaBERT model.

View Paper