CLIP-MoE: Towards Building Mixture of Experts for CLIP with Diversified Multiplet Upcycling
Jihai Zhang, Xiaoye Qu, Tong Zhu, Yu Cheng
2024-10-04

Summary
This paper presents CLIP-MoE, a new approach that improves the performance of the CLIP model by using a technique called Diversified Multiplet Upcycling to create a Mixture of Experts (MoE) system.
What's the problem?
The original CLIP model, which combines images and text, often loses important details during the encoding process and can only capture basic features. This limits its ability to understand complex images that have a lot of visual information, making it less effective for detailed tasks.
What's the solution?
To solve this issue, the authors propose a method called Diversified Multiplet Upcycling (DMU). This method fine-tunes multiple versions of the CLIP model to capture different aspects of the input data while sharing most of their parameters. These models are then combined into a Mixture of Experts (MoE) architecture, which allows the system to use specialized models for different tasks without needing extensive computational resources. This approach enhances the model's ability to process detailed images and improves its overall performance in various tasks like image classification and retrieval.
Why it matters?
This research is significant because it demonstrates how combining multiple specialized models can lead to better understanding and processing of complex visual information. By improving the capabilities of CLIP through DMU and MoE, this work can enhance applications in areas like computer vision, artificial intelligence, and multimodal learning systems, making them more efficient and effective.
Abstract
In recent years, Contrastive Language-Image Pre-training (CLIP) has become a cornerstone in multimodal intelligence. However, recent studies have identified that the information loss in the CLIP encoding process is substantial, and CLIP tends to capture only coarse-grained features from the input. This deficiency significantly limits the ability of a single CLIP model to handle images rich in visual detail. In this work, we propose a simple yet effective model-agnostic strategy, Diversified Multiplet Upcycling (DMU), for CLIP. DMU efficiently fine-tunes a series of CLIP models that capture different feature spaces, from a dense pre-trained CLIP checkpoint, sharing parameters except for the Feed-Forward Network (FFN). These models can then be transformed into a CLIP-MoE with a larger model capacity, leading to significantly enhanced performance with minimal computational overhead. To the best of our knowledge, Diversified Multiplet Upcycling is the first approach to introduce sparsely activated MoE into CLIP foundation models. Extensive experiments demonstrate the significant performance of CLIP-MoE across various zero-shot retrieval, zero-shot image classification tasks, and downstream Multimodal Large Language Model (MLLM) benchmarks by serving as a vision encoder. Furthermore, Diversified Multiplet Upcycling enables the conversion of any dense CLIP model into CLIP-MoEs, which can seamlessly replace CLIP in a plug-and-play manner without requiring further adaptation in downstream frameworks. Through Diversified Multiplet Upcycling, we aim to provide valuable insights for future research on developing more efficient and effective multimodal learning systems.