Mentor-KD: Making Small Language Models Better Multi-step Reasoners

Hojae Lee, Junho Kim, SangKeun Lee

2024-10-14

Mentor-KD: Making Small Language Models Better Multi-step Reasoners

Summary

This paper presents Mentor-KD, a new method that helps smaller language models improve their ability to reason through complex problems by learning from larger, more capable models.

What's the problem?

While large language models (LLMs) are great at multi-step reasoning, smaller models often struggle with this skill. They can be faster and more efficient but lack the depth of reasoning needed for complex tasks. Previous methods to train smaller models have faced challenges because they didn't provide enough high-quality examples or detailed guidance on how to reason through problems.

What's the solution?

Mentor-KD addresses these challenges by using a 'mentor' model, which is a medium-sized model that has been specifically trained to generate helpful reasoning examples. This mentor model provides additional training data and guidance (called soft labels) to the smaller 'student' model. By learning from the mentor's reasoning process, the student model becomes better at solving complex problems without needing to be as large as the mentor.

Why it matters?

This research is important because it allows smaller language models to perform better in reasoning tasks while maintaining their efficiency. This can make them more useful in real-world applications, such as chatbots or educational tools, where quick and accurate responses are essential.

Abstract

Large Language Models (LLMs) have displayed remarkable performances across various complex tasks by leveraging Chain-of-Thought (CoT) prompting. Recently, studies have proposed a Knowledge Distillation (KD) approach, reasoning distillation, which transfers such reasoning ability of LLMs through fine-tuning language models of multi-step rationales generated by LLM teachers. However, they have inadequately considered two challenges regarding insufficient distillation sets from the LLM teacher model, in terms of 1) data quality and 2) soft label provision. In this paper, we propose Mentor-KD, which effectively distills the multi-step reasoning capability of LLMs to smaller LMs while addressing the aforementioned challenges. Specifically, we exploit a mentor, intermediate-sized task-specific fine-tuned model, to augment additional CoT annotations and provide soft labels for the student model during reasoning distillation. We conduct extensive experiments and confirm Mentor-KD's effectiveness across various models and complex reasoning tasks.

View Paper