Drop-Upcycling: Training Sparse Mixture of Experts with Partial Re-initialization

Taishi Nakamura, Takuya Akiba, Kazuki Fujii, Yusuke Oda, Rio Yokota, Jun Suzuki

2025-02-27

Drop-Upcycling: Training Sparse Mixture of Experts with Partial
Re-initialization

Summary

This paper talks about a new way to train AI models called Drop-Upcycling, which makes large AI systems more efficient and powerful by combining pre-existing knowledge with fresh learning.

What's the problem?

Current methods for training large AI models, especially those using a technique called Mixture of Experts (MoE), face a dilemma. Starting with pre-trained knowledge (upcycling) gives a quick boost but slows down learning in the long run. On the other hand, starting from scratch takes longer initially but can lead to better performance over time.

What's the solution?

The researchers created Drop-Upcycling, which cleverly combines the benefits of both approaches. It starts with pre-trained knowledge but deliberately 'forgets' some parts by reinitializing them randomly. This mix of old and new knowledge helps the AI specialize better and learn more efficiently. They tested this method extensively and found it outperforms other ways of building MoE models, especially when training on massive amounts of data.

Why it matters?

This matters because it could make powerful AI systems much more efficient to train and run. The researchers showed their method could create an AI model that performs as well as a much larger traditional model while using only about a quarter of the computing power. This could lead to more advanced AI that's cheaper and faster to develop, potentially speeding up progress in various fields that use AI. By making their work openly available, they're also encouraging other researchers to build on and improve this technique.

Abstract

The Mixture of Experts (MoE) architecture reduces the training and inference cost significantly compared to a dense model of equivalent capacity. Upcycling is an approach that initializes and trains an MoE model using a pre-trained dense model. While upcycling leads to initial performance gains, the training progresses slower than when trained from scratch, leading to suboptimal performance in the long term. We propose Drop-Upcycling - a method that effectively addresses this problem. Drop-Upcycling combines two seemingly contradictory approaches: utilizing the knowledge of pre-trained dense models while statistically re-initializing some parts of the weights. This approach strategically promotes expert specialization, significantly enhancing the MoE model's efficiency in knowledge acquisition. Extensive large-scale experiments demonstrate that Drop-Upcycling significantly outperforms previous MoE construction methods in the long term, specifically when training on hundreds of billions of tokens or more. As a result, our MoE model with 5.9B active parameters achieves comparable performance to a 13B dense model in the same model family, while requiring approximately 1/4 of the training FLOPs. All experimental resources, including source code, training data, model checkpoints and logs, are publicly available to promote reproducibility and future research on MoE.

View Paper