GenRecal: Generation after Recalibration from Large to Small Vision-Language Models
Byung-Kwan Lee, Ryo Hachiuma, Yong Man Ro, Yu-Chiang Frank Wang, Yueh-Hua Wu
2025-06-19
Summary
This paper talks about GenRecal, a new way to improve vision-language models by training smaller models to mimic larger ones, making sure their internal understanding of images and text matches closely.
What's the problem?
The problem is that big vision-language models are often too large and slow to use, so smaller models are needed, but it's challenging to make these smaller models perform as well, especially when their internal processing differs from the larger models.
What's the solution?
The researchers created a distillation framework that aligns the feature representations inside the models, meaning they make the smaller model learn to behave like the bigger one not just in output but also in how it processes information, improving performance across different model types.
Why it matters?
This matters because it helps build faster and more efficient vision-language AI systems without losing accuracy, making these technologies more accessible for use in real-world applications like image captioning and question answering.
Abstract
GenRecal, a novel distillation framework, improves performance of vision-language models by aligning feature representations across different architectures.