OLMoE: Open Mixture-of-Experts Language Models

Niklas Muennighoff, Luca Soldaini, Dirk Groeneveld, Kyle Lo, Jacob Morrison, Sewon Min, Weijia Shi, Pete Walsh, Oyvind Tafjord, Nathan Lambert, Yuling Gu, Shane Arora, Akshita Bhagia, Dustin Schwenk, David Wadden, Alexander Wettig, Binyuan Hui, Tim Dettmers, Douwe Kiela, Ali Farhadi, Noah A. Smith, Pang Wei Koh

2024-09-04

OLMoE: Open Mixture-of-Experts Language Models

Summary

This paper talks about OLMoE, a new open-source language model that uses a technique called Mixture-of-Experts (MoE) to improve performance while being efficient in its use of computing resources.

What's the problem?

Traditional language models often require a lot of computational power and data to perform well, making them difficult to use in real-world applications, especially on devices with limited resources. This can limit their accessibility and effectiveness in various tasks.

What's the solution?

OLMoE introduces a model with 7 billion parameters but only activates 1 billion parameters for processing each input. This allows it to be more efficient while still achieving high performance. The model was trained on a massive dataset of 5 trillion tokens, and it has been shown to outperform other models with similar active parameters, even those that are larger. Additionally, OLMoE is fully open-source, meaning anyone can access its code, training data, and other resources to experiment with it.

Why it matters?

This research is important because it makes advanced language modeling technology more accessible and efficient. By using the Mixture-of-Experts approach, OLMoE can run on less powerful devices while still providing excellent performance. This opens up new possibilities for using AI in various applications, such as mobile apps and smart devices, making technology more available to everyone.

Abstract

We introduce OLMoE, a fully open, state-of-the-art language model leveraging sparse Mixture-of-Experts (MoE). OLMoE-1B-7B has 7 billion (B) parameters but uses only 1B per input token. We pretrain it on 5 trillion tokens and further adapt it to create OLMoE-1B-7B-Instruct. Our models outperform all available models with similar active parameters, even surpassing larger ones like Llama2-13B-Chat and DeepSeekMoE-16B. We present various experiments on MoE training, analyze routing in our model showing high specialization, and open-source all aspects of our work: model weights, training data, code, and logs.

View Paper