GPT or BERT: why not both?

Lucas Georges Gabriel Charpentier, David Samuel

2024-11-04

Summary

This paper discusses a new model called GPT-BERT, which combines two popular types of language models: GPT (Generative Pre-trained Transformer) and BERT (Bidirectional Encoder Representations from Transformers). The goal is to take the best features from both models to improve how text is generated and understood.

What's the problem?

Most language models either focus on predicting the next word in a sentence (like GPT) or understanding the context of words in a sentence by looking at all the words at once (like BERT). This separation means that models often miss out on the advantages of using both approaches together, leading to less effective performance in various tasks.

What's the solution?

The authors created GPT-BERT by merging masked language modeling (used in BERT) with causal language modeling (used in GPT). They developed a training method that allows the model to learn from both approaches simultaneously without needing separate architectures. This hybrid model was tested during the BabyLM Challenge 2024 and showed better performance than models that only used one of the two methods.

Why it matters?

This research is significant because it enhances how language models can generate and understand text. By combining the strengths of GPT and BERT, GPT-BERT can perform better in tasks like text generation and comprehension, making it a valuable tool for applications in natural language processing. Additionally, by openly sharing their models and code, the authors encourage further research and development in this area.

Abstract

We present a simple way to merge masked language modeling with causal language modeling. This hybrid training objective results in a model that combines the strengths of both modeling paradigms within a single transformer stack: GPT-BERT can be transparently used like any standard causal or masked language model. We test the pretraining process that enables this flexible behavior on the BabyLM Challenge 2024. The results show that the hybrid pretraining outperforms masked-only or causal-only models. We openly release the models, training corpora and code.

View Paper