Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design

Ruisi Cai, Yeonju Ro, Geon-Woo Kim, Peihao Wang, Babak Ehteshami Bejnordi, Aditya Akella, Zhangyang Wang

2024-10-28

Read-ME: Refactorizing LLMs as Router-Decoupled Mixture of Experts with System Co-Design

Summary

This paper discusses Read-ME, a new method for improving large language models (LLMs) by transforming them into more efficient models called Mixture-of-Experts (MoE) that can better manage resources during inference.

What's the problem?

Large language models are powerful but can be inefficient when making predictions, especially regarding memory use and processing speed. Traditional MoE models can help by using specialized subnetworks for different tasks, but they often face issues like poor memory management and high training costs when developed from scratch. This makes it difficult to use them effectively in real-world applications.

What's the solution?

The authors propose Read-ME, which refactors existing dense LLMs into smaller MoE models without needing to train them from the ground up. They use a technique called activation sparsity to identify and extract specialized experts from the original model. Additionally, they introduce a new router system that simplifies how these experts are selected and improves the efficiency of processing requests. This new approach allows for better pre-computing and scheduling of tasks, ultimately leading to faster and more efficient model performance.

Why it matters?

This research is important because it provides a way to enhance the efficiency of large language models without the high costs typically associated with training new models. By improving how these models operate, Read-ME can make advanced AI technologies more accessible and effective, especially in environments with limited resources.

Abstract

The proliferation of large language models (LLMs) has led to the adoption of Mixture-of-Experts (MoE) architectures that dynamically leverage specialized subnetworks for improved efficiency and performance. Despite their benefits, MoE models face significant challenges during inference, including inefficient memory management and suboptimal batching, due to misaligned design choices between the model architecture and the system policies. Furthermore, the conventional approach of training MoEs from scratch is increasingly prohibitive in terms of cost. In this paper, we propose a novel framework Read-ME that transforms pre-trained dense LLMs into smaller MoE models (in contrast to "upcycling" generalist MoEs), avoiding the high costs of ground-up training. Our approach employs activation sparsity to extract experts. To compose experts, we examine the widely-adopted layer-wise router design and show its redundancy, and thus we introduce the pre-gating router decoupled from the MoE backbone that facilitates system-friendly pre-computing and lookahead scheduling, enhancing expert-aware batching and caching. Our codesign therefore addresses critical gaps on both the algorithmic and system fronts, establishing a scalable and efficient alternative for LLM inference in resource-constrained settings. Read-ME outperforms other popular open-source dense models of similar scales, achieving improvements of up to 10.1% on MMLU, and improving mean end-to-end latency up to 6.1%. Codes are available at: https://github.com/VITA-Group/READ-ME.

View Paper