Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Benjamin Warner, Antoine Chaffin, Benjamin Clavié, Orion Weller, Oskar Hallström, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, Nathan Cooper, Griffin Adams, Jeremy Howard, Iacopo Poli

2024-12-19

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Summary

This paper talks about ModernBERT, a new and improved version of the BERT model that enhances performance for tasks like text classification and retrieval while being faster and more memory-efficient.

What's the problem?

While the original BERT model has been widely used for various language tasks, it hasn't seen significant improvements since its release. Additionally, it can only handle short text sequences (up to 512 tokens), which limits its effectiveness for longer documents or complex tasks that require understanding more context.

What's the solution?

ModernBERT addresses these issues by incorporating several advanced techniques. It can process much longer sequences (up to 8192 tokens) and is trained on a massive dataset of 2 trillion tokens. Key innovations include Rotary Positional Embeddings for better handling of long texts, Flash Attention to speed up calculations, and methods to avoid wasting resources on unnecessary padding during processing. These changes allow ModernBERT to perform better on a variety of tasks while using less memory and being faster than previous models.

Why it matters?

This research is important because it pushes the boundaries of what language models can do, making them more efficient and capable of handling complex tasks. By improving how these models work, ModernBERT can benefit many applications in natural language processing, such as search engines, chatbots, and any software that relies on understanding large amounts of text.

Abstract

Encoder-only transformer models such as BERT offer a great performance-size tradeoff for retrieval and classification tasks with respect to larger decoder-only models. Despite being the workhorse of numerous production pipelines, there have been limited Pareto improvements to BERT since its release. In this paper, we introduce ModernBERT, bringing modern model optimizations to encoder-only models and representing a major Pareto improvement over older encoders. Trained on 2 trillion tokens with a native 8192 sequence length, ModernBERT models exhibit state-of-the-art results on a large pool of evaluations encompassing diverse classification tasks and both single and multi-vector retrieval on different domains (including code). In addition to strong downstream performance, ModernBERT is also the most speed and memory efficient encoder and is designed for inference on common GPUs.

View Paper