Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

Zihan Wang, Deli Chen, Damai Dai, Runxin Xu, Zhuoshu Li, Y. Wu

2024-07-05

Let the Expert Stick to His Last: Expert-Specialized Fine-Tuning for Sparse Architectural Large Language Models

Summary

This paper talks about a new method called Expert-Specialized Fine-Tuning (ESFT) designed to improve how large language models (LLMs) can be customized for specific tasks using a structure known as Mixture-of-Experts (MoE).

What's the problem?

The main problem is that while there are many methods for fine-tuning LLMs, most of them focus on dense architectures where all parts of the model are used for every task. This can be inefficient, especially when working with sparse architectures like MoE, which use different 'experts' (specialized parts of the model) for different tasks. The existing methods do not effectively utilize these experts, leading to wasted resources and suboptimal performance.

What's the solution?

To address this issue, the authors propose ESFT, which focuses on fine-tuning only the experts that are most relevant to a specific task while keeping the other experts unchanged. This targeted approach allows the model to adapt more efficiently and effectively to new tasks. The researchers found that this method not only speeds up the fine-tuning process but also achieves performance levels that are equal to or better than traditional methods that adjust all parameters in the model.

Why it matters?

This research is important because it makes it easier and more efficient to customize large language models for various applications. By improving how these models utilize their expert components, ESFT can lead to better performance in real-world tasks like language translation, content generation, and more, while also saving computational resources.

Abstract

Parameter-efficient fine-tuning (PEFT) is crucial for customizing Large Language Models (LLMs) with constrained resources. Although there have been various PEFT methods for dense-architecture LLMs, PEFT for sparse-architecture LLMs is still underexplored. In this work, we study the PEFT method for LLMs with the Mixture-of-Experts (MoE) architecture and the contents of this work are mainly threefold: (1) We investigate the dispersion degree of the activated experts in customized tasks, and found that the routing distribution for a specific task tends to be highly concentrated, while the distribution of activated experts varies significantly across different tasks. (2) We propose Expert-Specialized Fine-Tuning, or ESFT, which tunes the experts most relevant to downstream tasks while freezing the other experts and modules; experimental results demonstrate that our method not only improves the tuning efficiency, but also matches or even surpasses the performance of full-parameter fine-tuning. (3) We further analyze the impact of the MoE architecture on expert-specialized fine-tuning. We find that MoE models with finer-grained experts are more advantageous in selecting the combination of experts that are most relevant to downstream tasks, thereby enhancing both the training efficiency and effectiveness.

View Paper