CamemBERT 2.0: A Smarter French Language Model Aged to Perfection

Wissam Antoun, Francis Kulumba, Rian Touchent, Éric de la Clergerie, Benoît Sagot, Djamé Seddah

2024-11-14

CamemBERT 2.0: A Smarter French Language Model Aged to Perfection

Summary

This paper discusses CamemBERT 2.0, an updated French language model that improves natural language processing by addressing issues with outdated training data.

What's the problem?

Many French language models, like the original CamemBERT, struggle to perform well because they are trained on old data. This leads to a problem known as 'temporal concept drift,' where the models fail to understand new words and topics that have emerged over time.

What's the solution?

The authors introduced two new versions of CamemBERT: CamemBERTav2 and CamemBERTv2. These models are built on advanced architectures (DeBERTaV3 and RoBERTa) and trained on a much larger and more recent dataset. They also feature an improved tokenizer that better handles modern French, including new vocabulary like emojis. The new models were tested on various tasks and showed significantly better performance than their predecessors.

Why it matters?

This research is important because it provides updated tools for natural language processing in French, which can enhance applications in fields like customer service, healthcare, and education. By improving how machines understand the French language, we can create more effective AI systems that better serve users.

Abstract

French language models, such as CamemBERT, have been widely adopted across industries for natural language processing (NLP) tasks, with models like CamemBERT seeing over 4 million downloads per month. However, these models face challenges due to temporal concept drift, where outdated training data leads to a decline in performance, especially when encountering new topics and terminology. This issue emphasizes the need for updated models that reflect current linguistic trends. In this paper, we introduce two new versions of the CamemBERT base model-CamemBERTav2 and CamemBERTv2-designed to address these challenges. CamemBERTav2 is based on the DeBERTaV3 architecture and makes use of the Replaced Token Detection (RTD) objective for better contextual understanding, while CamemBERTv2 is built on RoBERTa, which uses the Masked Language Modeling (MLM) objective. Both models are trained on a significantly larger and more recent dataset with longer context length and an updated tokenizer that enhances tokenization performance for French. We evaluate the performance of these models on both general-domain NLP tasks and domain-specific applications, such as medical field tasks, demonstrating their versatility and effectiveness across a range of use cases. Our results show that these updated models vastly outperform their predecessors, making them valuable tools for modern NLP systems. All our new models, as well as intermediate checkpoints, are made openly available on Huggingface.

View Paper