TransAgent: Transfer Vision-Language Foundation Models with Heterogeneous Agent Collaboration
Yiwei Guo, Shaobin Zhuang, Kunchang Li, Yu Qiao, Yali Wang
2024-10-18

Summary
This paper introduces TransAgent, a new framework that combines knowledge from various specialized models to improve the performance of vision-language models like CLIP, making them better at understanding and generating visual content.
What's the problem?
Vision-language models, such as CLIP, are powerful but often struggle to adapt to new types of data that differ significantly from what they were originally trained on. This is because these models typically rely on a single training approach and do not utilize the diverse knowledge available from other specialized models, which can limit their effectiveness in real-world applications.
What's the solution?
To solve this problem, the authors developed TransAgent, which allows for the integration of knowledge from multiple specialized models (referred to as 'isolated agents') into a unified framework. This framework uses a technique called multi-source knowledge distillation to help the main model (like CLIP) learn from these diverse sources without needing extensive retraining. By collaborating with 11 different specialized models, TransAgent enhances the main model's ability to generalize and perform well across various tasks without increasing the computational cost during inference.
Why it matters?
This research is important because it improves how AI systems can understand and generate visual content by leveraging a broader range of knowledge. By making vision-language models more adaptable and efficient, TransAgent can enhance applications in areas such as image recognition, video analysis, and other fields where understanding visual information is crucial. This advancement could lead to more intelligent AI systems that perform better in real-world scenarios.
Abstract
Vision-language foundation models (such as CLIP) have recently shown their power in transfer learning, owing to large-scale image-text pre-training. However, target domain data in the downstream tasks can be highly different from the pre-training phase, which makes it hard for such a single model to generalize well. Alternatively, there exists a wide range of expert models that contain diversified vision and/or language knowledge pre-trained on different modalities, tasks, networks, and datasets. Unfortunately, these models are "isolated agents" with heterogeneous structures, and how to integrate their knowledge for generalizing CLIP-like models has not been fully explored. To bridge this gap, we propose a general and concise TransAgent framework, which transports the knowledge of the isolated agents in a unified manner, and effectively guides CLIP to generalize with multi-source knowledge distillation. With such a distinct framework, we flexibly collaborate with 11 heterogeneous agents to empower vision-language foundation models, without further cost in the inference phase. Finally, our TransAgent achieves state-of-the-art performance on 11 visual recognition datasets. Under the same low-shot setting, it outperforms the popular CoOp with around 10% on average, and 20% on EuroSAT which contains large domain shifts.