DDK: Distilling Domain Knowledge for Efficient Large Language Models
Jiaheng Liu, Chenchen Zhang, Jinyang Guo, Yuanxing Zhang, Haoran Que, Ken Deng, Zhiqi Bai, Jie Liu, Ge Zhang, Jiakai Wang, Yanan Wu, Congnan Liu, Wenbo Su, Jiamang Wang, Lin Qu, Bo Zheng
2024-07-25

Summary
This paper introduces DDK, a new method for improving the efficiency of large language models (LLMs) by transferring knowledge from a larger, more powerful model (the teacher) to a smaller, more efficient one (the student). It focuses on optimizing the training data used in this process to enhance performance across different areas.
What's the problem?
Large language models are very powerful but also require a lot of computing resources and storage space. When trying to make smaller models that can still perform well, existing methods often overlook differences in how well the teacher and student models perform in different areas or domains. This can lead to the smaller model not learning effectively, especially in domains where it needs more help, resulting in lower overall performance.
What's the solution?
DDK addresses this problem by dynamically adjusting the training data used for the smaller model based on how well it performs compared to the larger model in various domains. It uses a method called 'domain discrepancy factor' to identify which areas need more focus during training. By sampling data from these targeted domains more frequently, DDK helps the student model learn better and improves its performance significantly without needing extensive retraining.
Why it matters?
This research is important because it allows for the creation of smaller, more efficient language models that can operate effectively even on devices with limited resources. By improving how knowledge is transferred from larger models, DDK can help make advanced AI technology more accessible and usable in everyday applications, such as on smartphones or other devices.
Abstract
Despite the advanced intelligence abilities of large language models (LLMs) in various applications, they still face significant computational and storage demands. Knowledge Distillation (KD) has emerged as an effective strategy to improve the performance of a smaller LLM (i.e., the student model) by transferring knowledge from a high-performing LLM (i.e., the teacher model). Prevailing techniques in LLM distillation typically use a black-box model API to generate high-quality pretrained and aligned datasets, or utilize white-box distillation by altering the loss function to better transfer knowledge from the teacher LLM. However, these methods ignore the knowledge differences between the student and teacher LLMs across domains. This results in excessive focus on domains with minimal performance gaps and insufficient attention to domains with large gaps, reducing overall performance. In this paper, we introduce a new LLM distillation framework called DDK, which dynamically adjusts the composition of the distillation dataset in a smooth manner according to the domain performance differences between the teacher and student models, making the distillation process more stable and effective. Extensive evaluations show that DDK significantly improves the performance of student models, outperforming both continuously pretrained baselines and existing knowledge distillation methods by a large margin.