VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models

Byung-Kwan Lee, Ryo Hachiuma, Yu-Chiang Frank Wang, Yong Man Ro, Yueh-Hua Wu

2024-12-03

VLsI: Verbalized Layers-to-Interactions from Large to Small Vision Language Models

Summary

This paper introduces VLsI, a new approach for improving smaller vision-language models (VLMs) by efficiently transferring knowledge from larger models without needing extra training or complicated changes.

What's the problem?

As researchers develop larger and more powerful vision-language models, it becomes challenging to use these models on devices with limited resources, like smartphones and robots. Larger models often require a lot of computational power and memory, making them impractical for everyday use. Additionally, existing methods to improve smaller models using larger ones can lead to instability and inefficiency.

What's the solution?

VLsI (Verbalized Layers-to-Interactions) addresses these challenges by using a unique method called layer-wise distillation. This process involves creating 'verbalizers' that help smaller models understand the reasoning of larger ones by mapping features from each layer to natural language. This way, smaller models can learn how to think like larger ones without needing extensive retraining or complex adjustments. The researchers tested VLsI on ten different benchmarks and found that it significantly improved performance compared to existing methods, achieving better results without increasing the model size or complexity.

Why it matters?

This research is important because it makes advanced AI technology more accessible by allowing smaller models to perform nearly as well as larger ones. By enhancing the efficiency of VLMs, VLsI can be used in various applications such as mobile apps, robots, and other devices where computing resources are limited. This could lead to more intelligent systems that can understand and generate both text and images effectively in everyday situations.

Abstract

The recent surge in high-quality visual instruction tuning samples from closed-source vision-language models (VLMs) such as GPT-4V has accelerated the release of open-source VLMs across various model sizes. However, scaling VLMs to improve performance using larger models brings significant computational challenges, especially for deployment on resource-constrained devices like mobile platforms and robots. To address this, we propose VLsI: Verbalized Layers-to-Interactions, a new VLM family in 2B and 7B model sizes, which prioritizes efficiency without compromising accuracy. VLsI leverages a unique, layer-wise distillation process, introducing intermediate "verbalizers" that map features from each layer to natural language space, allowing smaller VLMs to flexibly align with the reasoning processes of larger VLMs. This approach mitigates the training instability often encountered in output imitation and goes beyond typical final-layer tuning by aligning the small VLMs' layer-wise progression with that of the large ones. We validate VLsI across ten challenging vision-language benchmarks, achieving notable performance gains (11.0% for 2B and 17.4% for 7B) over GPT-4V without the need for model scaling, merging, or architectural changes.

View Paper