TinyR1-32B-Preview: Boosting Accuracy with Branch-Merge Distillation
Lin Sun, Guangxiang Zhao, Xiaoqi Jian, Yuhan Wu, Weihong Lin, Yongfu Zhu, Change Jia, Linglin Zhang, Jinzhu Wu, Junfeng Ran, Sai-er Hu, Zihan Jiang, Junting Zhou, Wenrui Liu, Bin Cui, Tong Yang, Xiangzheng Zhang
2025-03-10
Summary
This paper talks about a new method called Branch-Merge distillation that helps create smaller, more efficient AI language models without losing their ability to perform well on various tasks
What's the problem?
Large Language Models (LLMs) are very powerful but also very big, which makes them hard to use on regular computers. When researchers try to make smaller versions of these models, they often end up with models that aren't as good at solving problems
What's the solution?
The researchers created a two-step process called Branch-Merge distillation. First, they take knowledge from a big AI model and teach it to smaller, specialized models for different subjects like math, coding, and science. Then, they combine these specialized models into one smaller model that can do all these tasks well. They tested this method by creating a model called TinyR1-32B-Preview, which performed better than similar-sized models and almost as well as the original large model on some tasks
Why it matters?
This matters because it could make powerful AI language models more accessible to people who don't have super-powerful computers. It could lead to better AI assistants on phones or laptops that can help with a wide range of tasks, from solving math problems to writing code, without needing expensive hardware. This approach also saves time and money in creating these models, which could speed up AI research and development
Abstract
The challenge of reducing the size of Large Language Models (LLMs) while maintaining their performance has gained significant attention. However, existing methods, such as model distillation and transfer learning, often fail to achieve high accuracy. To address this limitation, we introduce the Branch-Merge distillation approach, which enhances model compression through two phases: (1) the Branch Phase, where knowledge from a large teacher model is selectively distilled into specialized student models via domain-specific supervised fine-tuning (SFT); And (2) the Merge Phase, where these student models are merged to enable cross-domain knowledge transfer and improve generalization. We validate our distillation approach using DeepSeek-R1 as the teacher and <PRE_TAG>DeepSeek-R1-Distill-Qwen-32B</POST_TAG> as the student. The resulting merged model, TinyR1-32B-Preview, outperforms its counterpart <PRE_TAG>DeepSeek-R1-Distill-Qwen-32B</POST_TAG> across multiple benchmarks, including Mathematics (+5.5 points), Coding (+4.4 points) and Science (+2.9 points), while achieving near-equal performance to DeepSeek-R1 on AIME 2024. The Branch-Merge distillation approach provides a scalable solution for creating smaller, high-performing LLMs with reduced computational cost and time.