Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

Shuhao Gu, Jialing Zhang, Siyuan Zhou, Kevin Yu, Zhaohu Xing, Liangdong Wang, Zhou Cao, Jintao Jia, Zhuoyi Zhang, Yixuan Wang, Zhenchong Hu, Bo-Wen Zhang, Jijie Li, Dong Liang, Yingli Zhao, Yulong Ao, Yaoqi Liu, Fangxiang Feng, Guang Liu

2024-10-28

Infinity-MM: Scaling Multimodal Performance with Large-Scale and High-Quality Instruction Data

Summary

This paper presents Infinity-MM, a new method for improving the performance of multimodal language models (MLLMs) by using a large-scale dataset of high-quality instructions.

What's the problem?

While multimodal language models have shown great potential, their performance is often limited by the small size and low quality of available instruction data. This makes it difficult for these models to compete with closed-source models that have access to better training data. The lack of diverse and high-quality instruction data hinders the ability of open-source models to learn effectively.

What's the solution?

The authors introduce Infinity-MM, a massive dataset containing 40 million instruction samples that have been carefully filtered for quality. They also developed a method for generating synthetic instructions using existing open-source models, which helps create more diverse training examples. With this new dataset, they trained a 2-billion-parameter model called Aquila-VL-2B, which achieved state-of-the-art performance compared to other models of similar size.

Why it matters?

This research is significant because it demonstrates that increasing the quantity and quality of training data can greatly enhance the capabilities of open-source multimodal models. By providing better resources for training, Infinity-MM can help improve AI systems in various applications, making them more effective at understanding and generating multimodal content.

Abstract

Vision-Language Models (VLMs) have recently made significant progress, but the limited scale and quality of open-source instruction data hinder their performance compared to closed-source models. In this work, we address this limitation by introducing Infinity-MM, a large-scale multimodal instruction dataset with 40 million samples, enhanced through rigorous quality filtering and deduplication. We also propose a synthetic instruction generation method based on open-source VLMs, using detailed image annotations and diverse question generation. Using this data, we trained a 2-billion-parameter VLM, Aquila-VL-2B, achieving state-of-the-art (SOTA) performance for models of similar scale. This demonstrates that expanding instruction data and generating synthetic data can significantly improve the performance of open-source models.

View Paper