Knowledge Transfer Across Modalities with Natural Language Supervision
Carlo Alberto Barbano, Luca Molinaro, Emanuele Aiello, Marco Grangetto
2024-11-26

Summary
This paper discusses a method called Knowledge Transfer that allows models to learn new concepts using only their textual descriptions, similar to how humans learn by reading about things.
What's the problem?
Many AI models struggle to understand new concepts unless they have seen examples of those concepts before. This limits their ability to adapt and learn from just descriptions, especially when they encounter something completely new that they haven't been trained on.
What's the solution?
The authors propose a method that uses a pre-trained visual encoder, which has already learned basic features like shapes and colors. By providing a textual description of a new concept, the model aligns these low-level features with the high-level description. This way, it can effectively learn and apply the new concept without needing many examples. The method works well with different types of models and improves the model's performance on various tasks, such as classifying images or retrieving information.
Why it matters?
This research is important because it enhances the flexibility and adaptability of AI models. By enabling them to learn from descriptions alone, it opens up new possibilities for applications in areas like image recognition, natural language processing, and more, making AI systems smarter and more capable of handling unfamiliar tasks.
Abstract
We present a way to learn novel concepts by only using their textual description. We call this method Knowledge Transfer. Similarly to human perception, we leverage cross-modal interaction to introduce new concepts. We hypothesize that in a pre-trained visual encoder there are enough low-level features already learned (e.g. shape, appearance, color) that can be used to describe previously unknown high-level concepts. Provided with a textual description of the novel concept, our method works by aligning the known low-level features of the visual encoder to its high-level textual description. We show that Knowledge Transfer can successfully introduce novel concepts in multimodal models, in a very efficient manner, by only requiring a single description of the target concept. Our approach is compatible with both separate textual and visual encoders (e.g. CLIP) and shared parameters across modalities. We also show that, following the same principle, Knowledge Transfer can improve concepts already known by the model. Leveraging Knowledge Transfer we improve zero-shot performance across different tasks such as classification, segmentation, image-text retrieval, and captioning.