Simple Semi-supervised Knowledge Distillation from Vision-Language Models via texttt{D}ual-texttt{H}ead texttt{O}ptimization

Seongjae Kang, Dong Bok Lee, Hyungjoon Jang, Sung Ju Hwang

2025-05-19

Simple Semi-supervised Knowledge Distillation from Vision-Language
Models via texttt{D}ual-texttt{H}ead
texttt{O}ptimization

Summary

This paper talks about a new method called Dual-Head Optimization (DHO), which helps transfer knowledge from large, powerful AI models that understand both images and text to smaller, more focused models that can do specific tasks really well.

What's the problem?

The problem is that while big vision-language models are smart, they use a lot of computer power and memory, so it's hard to use them in places where resources are limited, and making smaller models as good as the big ones is a challenge.

What's the solution?

The researchers created the DHO framework, which uses two prediction heads to help the smaller models learn more effectively from the big models, even when there isn't a lot of labeled training data. This makes the smaller models much better at their tasks without needing as many resources.

Why it matters?

This matters because it means we can have fast, efficient AI systems that still perform at a high level, making advanced technology more accessible and practical for everyone, even on devices like phones or in places with less computing power.

Abstract

Dual-Head Optimization (DHO) framework improves knowledge distillation from vision-language models to compact task-specific models by using dual prediction heads, achieving state-of-the-art performance with fewer parameters.

View Paper