MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

Xi Victoria Lin, Akshat Shrivastava, Liang Luo, Srinivasan Iyer, Mike Lewis, Gargi Gosh, Luke Zettlemoyer, Armen Aghajanyan

2024-08-01

MoMa: Efficient Early-Fusion Pre-training with Mixture of Modality-Aware Experts

Summary

This paper presents MoMa, a new architecture designed to improve the efficiency of training language models that can understand both text and images. It uses a technique called mixture of modality-aware experts to handle different types of data more effectively.

What's the problem?

Current models that work with multiple types of data, like text and images, often waste computational resources and struggle to perform well. They typically use the same methods for all types of data, which can lead to inefficiencies and less accurate results, especially when processing complex information from different modalities.

What's the solution?

To solve this problem, the authors developed MoMa, which organizes its processing into specific groups of experts that focus on either text or images. This allows the model to allocate resources more effectively by using specialized experts for each type of data. MoMa also includes a smart routing system that directs each piece of data to the right expert group. As a result, MoMa achieves significant savings in computational power while maintaining high performance during training.

Why it matters?

This research is important because it enhances the ability of AI systems to process and understand mixed types of information more efficiently. By improving how models handle both text and images, MoMa can lead to faster and more effective AI applications in areas like image recognition, natural language processing, and multimedia analysis. This could ultimately make AI technologies more accessible and powerful for various real-world applications.

Abstract

We introduce MoMa, a novel modality-aware mixture-of-experts (MoE) architecture designed for pre-training mixed-modal, early-fusion language models. MoMa processes images and text in arbitrary sequences by dividing expert modules into modality-specific groups. These groups exclusively process designated tokens while employing learned routing within each group to maintain semantically informed adaptivity. Our empirical results reveal substantial pre-training efficiency gains through this modality-specific parameter allocation. Under a 1-trillion-token training budget, the MoMa 1.4B model, featuring 4 text experts and 4 image experts, achieves impressive FLOPs savings: 3.7x overall, with 2.6x for text and 5.2x for image processing compared to a compute-equivalent dense baseline, measured by pre-training loss. This outperforms the standard expert-choice MoE with 8 mixed-modal experts, which achieves 3x overall FLOPs savings (3x for text, 2.8x for image). Combining MoMa with mixture-of-depths (MoD) further improves pre-training FLOPs savings to 4.2x overall (text: 3.4x, image: 5.3x), although this combination hurts performance in causal inference due to increased sensitivity to router accuracy. These results demonstrate MoMa's potential to significantly advance the efficiency of mixed-modal, early-fusion language model pre-training, paving the way for more resource-efficient and capable multimodal AI systems.

View Paper