Should We Still Pretrain Encoders with Masked Language Modeling?

Hippolyte Gisserot-Boukhlef, Nicolas Boizard, Manuel Faysse, Duarte M. Alves, Emmanuel Malherbe, André F. T. Martins, Céline Hudelot, Pierre Colombo

2025-07-08

Should We Still Pretrain Encoders with Masked Language Modeling?

Summary

This paper talks about a new training strategy for language models that combines two popular methods: Causal Language Modeling (CLM) and Masked Language Modeling (MLM). This combination helps the model learn better text representations for various tasks.

What's the problem?

The problem is that traditional training methods use either masked language modeling, where some words are hidden and the model has to guess them using both left and right context, or causal language modeling, where the model predicts the next word only based on previous words. Each method has strengths and weaknesses, and it's unclear which is best for learning text representations.

What's the solution?

The researchers designed a two-phase training approach. They start by initializing the model with a pretrained causal language model, then fine-tune it using a mix of both causal and masked language modeling methods. This biphasic strategy leads to better performance in understanding and representing text than using either method alone.

Why it matters?

This matters because it helps build language models that understand text more deeply and work better across different language tasks, making AI systems like translators, chatbots, and search engines more accurate and effective.

Abstract

A biphasic training strategy combining Causal Language Modeling and Masked Language Modeling yields optimal text representation performance, especially when initialized with pretrained CLM models.

View Paper