xGen-MM (BLIP-3): A Family of Open Large Multimodal Models
Le Xue, Manli Shu, Anas Awadalla, Jun Wang, An Yan, Senthil Purushwalkam, Honglu Zhou, Viraj Prabhu, Yutong Dai, Michael S Ryoo, Shrikant Kendre, Jieyu Zhang, Can Qin, Shu Zhang, Chia-Chih Chen, Ning Yu, Juntao Tan, Tulika Manoj Awalgaonkar, Shelby Heinecke, Huan Wang, Yejin Choi, Ludwig Schmidt
2024-08-19

Summary
This paper discusses xGen-MM (also known as BLIP-3), a framework for developing large multimodal models that can understand and generate content from different types of data, such as text and images.
What's the problem?
Creating effective large multimodal models is challenging because they need to handle various types of information and perform well across different tasks. Existing models often struggle with safety issues and may not be as effective in understanding or generating content compared to their size.
What's the solution?
The authors developed xGen-MM, which includes carefully selected datasets, a training process, and various model architectures. They created multiple models within this framework, including a pre-trained base model that learns well from context and an instruction-tuned model that performs competitively against other open-source models. They also introduced a safety-tuned model to reduce harmful behaviors and improve overall safety.
Why it matters?
This research is significant because it provides tools and resources for advancing the development of multimodal models, which are essential for applications like virtual assistants, content creation, and more. By making their models and datasets publicly available, the authors encourage further research and innovation in this area.
Abstract
This report introduces xGen-MM (also known as BLIP-3), a framework for developing Large Multimodal Models (LMMs). The framework comprises meticulously curated datasets, a training recipe, model architectures, and a resulting suite of LMMs. xGen-MM, short for xGen-MultiModal, expands the Salesforce xGen initiative on foundation AI models. Our models undergo rigorous evaluation across a range of tasks, including both single and multi-image benchmarks. Our pre-trained base model exhibits strong in-context learning capabilities and the instruction-tuned model demonstrates competitive performance among open-source LMMs with similar model sizes. In addition, we introduce a safety-tuned model with DPO, aiming to mitigate harmful behaviors such as hallucinations and improve safety. We open-source our models, curated large-scale datasets, and our fine-tuning codebase to facilitate further advancements in LMM research. Associated resources will be available on our project page above.