Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers

Shwai He, Tao Ge, Guoheng Sun, Bowei Tian, Xiaoyang Wang, Ang Li, Dong Yu

2024-10-22

Router-Tuning: A Simple and Effective Approach for Enabling Dynamic-Depth in Transformers

Summary

This paper discusses a new method called Router-Tuning that helps transformer models work more efficiently by allowing them to adjust how many processing steps they use based on the complexity of the input.

What's the problem?

Traditional transformer models use the same number of processing steps for every input, which can waste resources and slow down performance. This is especially problematic when some inputs are simpler and don’t need as many steps.

What's the solution?

Router-Tuning introduces a 'router' that determines how many processing layers each input should go through. This way, simpler inputs can skip unnecessary steps, making the model faster and more efficient. Additionally, a technique called MindSkip helps maintain performance even when some layers are skipped.

Why it matters?

This approach is significant because it can lead to faster processing times and lower computational costs without sacrificing accuracy, making transformer models more practical for real-world applications.

Abstract

Traditional transformer models often allocate a fixed amount of computational resources to every input token, leading to inefficient and unnecessary computation. To address this, the Mixture of Depths (MoD) was introduced to dynamically adjust the computational depth by skipping less important layers. Despite its promise, current MoD approaches remain under-explored and face two main challenges: (1) high training costs due to the need to train the entire model along with the routers that determine which layers to skip, and (2) the risk of performance degradation when important layers are bypassed. In response to the first issue, we propose Router-Tuning, a method that fine-tunes only the router on a small dataset, drastically reducing the computational overhead associated with full model training. For the second challenge, we propose MindSkip, which deploys Attention with Dynamic Depths. This method preserves the model's performance while significantly enhancing computational and memory efficiency. Extensive experiments demonstrate that our approach delivers competitive results while dramatically improving the computation efficiency, e.g., 21\% speedup and only a 0.2\% performance drop. The code is released at https://github.com/CASE-Lab-UMD/Router-Tuning.

View Paper