Masking Teacher and Reinforcing Student for Distilling Vision-Language Models

Byung-Kwan Lee, Yu-Chiang Frank Wang, Ryo Hachiuma

2025-12-29

Masking Teacher and Reinforcing Student for Distilling Vision-Language Models

Summary

This paper focuses on making really smart AI systems that can understand both images and language smaller and more practical for use on phones or other devices with limited computing power.

What's the problem?

Current AI models that handle images and language are huge, making them difficult to use on devices like smartphones. Trying to shrink these models while still maintaining their intelligence is tough because the smaller 'student' model struggles to learn from the complex information provided by the larger 'teacher' model, leading to unstable learning and worse results.

What's the solution?

The researchers developed a method called Masters, which stands for Masking Teacher and Reinforcing Student. It works in two main steps: first, they simplify the large 'teacher' model by temporarily hiding some of its less important parts. Then, they gradually bring those parts back, allowing the smaller 'student' model to learn more effectively. They also use a special type of learning called reinforcement learning, but instead of having the AI think step-by-step, they use pre-made answers from the simplified teacher to guide the student, making the process faster and more efficient.

Why it matters?

This research is important because it allows us to create powerful AI systems that can understand images and language and run directly on our phones or other devices without needing a constant internet connection to a powerful computer. This opens up possibilities for more accessible and private AI applications.

Abstract

Large-scale vision-language models (VLMs) have recently achieved remarkable multimodal understanding, but their massive size makes them impractical for deployment on mobile or edge devices. This raises the need for compact yet capable VLMs that can efficiently learn from powerful large teachers. However, distilling knowledge from a large teacher to a small student remains challenging due to their large size gap: the student often fails to reproduce the teacher's complex, high-dimensional representations, leading to unstable learning and degraded performance. To address this, we propose Masters (Masking Teacher and Reinforcing Student), a mask-progressive reinforcement learning (RL) distillation framework. Masters first masks non-dominant weights of the teacher to reduce unnecessary complexity, then progressively restores the teacher by gradually increasing its capacity during training. This strategy allows the student to learn richer representations from the teacher in a smooth and stable manner. To further refine knowledge transfer, Masters integrates an offline RL stage with two complementary rewards: an accuracy reward that measures the correctness of the generated responses, and a distillation reward that quantifies the ease of transferring responses from teacher to student. Unlike online think-answer RL paradigms that are computationally expensive and generate lengthy responses, our offline RL leverages pre-generated responses from masked teachers. These provide rich yet efficient guidance, enabling students to achieve strong performance without requiring the think-answer process.

View Paper