Chimera: Improving Generalist Model with Domain-Specific Experts
Tianshuo Peng, Mingsheng Li, Hongbin Zhou, Renqiu Xia, Renrui Zhang, Lei Bai, Song Mao, Bin Wang, Conghui He, Aojun Zhou, Botian Shi, Tao Chen, Bo Zhang, Xiangyu Yue
2024-12-11

Summary
This paper talks about Chimera, a new system that enhances generalist large multimodal models (LMMs) by integrating specialized expert models for better performance on specific tasks.
What's the problem?
Generalist models are trained on a wide range of data but often struggle with specialized tasks that require in-depth knowledge, like understanding complex charts or performing advanced math. These models are usually trained on large datasets dominated by natural images, which means they may not perform well in specific domains where expert knowledge is necessary. Additionally, combining generalist models with specialized expert models can be difficult due to differences in how they represent information.
What's the solution?
The authors introduce Chimera, which uses a unique approach to combine the strengths of generalist LMMs with domain-specific experts. They developed a progressive training strategy that allows the model to integrate features from expert models into the generalist model's input. To improve collaboration between these models, they created a Generalist-Specialist Collaboration Masking (GSCM) mechanism, which helps align their representations. This results in a versatile model that performs well across various domains, such as charts, tables, and documents, achieving state-of-the-art results in multi-modal reasoning and visual content extraction tasks.
Why it matters?
This research is important because it shows how we can improve AI models by making them more adaptable and capable of handling specialized tasks. By integrating expert knowledge into generalist models, Chimera enhances their performance significantly, paving the way for better applications in fields like data analysis, education, and any area where understanding complex visual information is crucial.
Abstract
Recent advancements in Large Multi-modal Models (LMMs) underscore the importance of scaling by increasing image-text paired data, achieving impressive performance on general tasks. Despite their effectiveness in broad applications, generalist models are primarily trained on web-scale datasets dominated by natural images, resulting in the sacrifice of specialized capabilities for domain-specific tasks that require extensive domain prior knowledge. Moreover, directly integrating expert models tailored for specific domains is challenging due to the representational gap and imbalanced optimization between the generalist model and experts. To address these challenges, we introduce Chimera, a scalable and low-cost multi-modal pipeline designed to boost the ability of existing LMMs with domain-specific experts. Specifically, we design a progressive training strategy to integrate features from expert models into the input of a generalist LMM. To address the imbalanced optimization caused by the well-aligned general visual encoder, we introduce a novel Generalist-Specialist Collaboration Masking (GSCM) mechanism. This results in a versatile model that excels across the chart, table, math, and document domains, achieving state-of-the-art performance on multi-modal reasoning and visual content extraction tasks, both of which are challenging tasks for assessing existing LMMs.