Explore the Limits of Omni-modal Pretraining at Scale

Yiyuan Zhang, Handong Li, Jing Liu, Xiangyu Yue

2024-06-14

Explore the Limits of Omni-modal Pretraining at Scale

Summary

This paper introduces a new approach called Multimodal Context (MiCo) aimed at creating 'omni-modal intelligence.' This means developing AI models that can understand and process various types of data, such as text, images, and videos, all at once.

What's the problem?

Many existing AI models are limited to understanding only one type of input at a time, like just text or just images. This makes them less effective in real-world situations where information comes from multiple sources. There is a need for models that can learn from and integrate different types of data to improve their performance and versatility.

What's the solution?

The authors propose the MiCo framework, which allows for scalable pretraining of models on a wide range of modalities and large amounts of data. MiCo helps the models learn universal representations that can be applied across different tasks. The research evaluates the performance of these models on various benchmarks, including single-modality tasks, cross-modality tasks (like answering questions based on both text and images), and multimodal language model benchmarks. The results show that these models set new records for performance in many areas.

Why it matters?

This research is significant because it pushes the boundaries of what AI can do by enabling it to understand and process multiple types of information simultaneously. By developing omni-modal intelligence, we can create more powerful AI systems that are better equipped to handle complex tasks in fields like healthcare, robotics, and education. The availability of the code and models encourages further research and development in this area.

Abstract

We propose to build omni-modal intelligence, which is capable of understanding any modality and learning universal representations. In specific, we propose a scalable pretraining paradigm, named Multimodal Context (MiCo), which can scale up the numbers of modalities and amount of data, together with the model parameters, in the pretraining process. With MiCo, the pretrained models show significant emergent abilities in multimodal learning, which are evaluated on the following tasks: i) single-modality perception benchmarks of 10 different modalities, ii) 25 cross-modality understanding tasks of retrieval, question-answering, captioning, and iii) 18 multimodal large language model benchmarks. Our models establish 37 new records for state-of-the-art performance. We hope that our research could contribute to the development of omni-modal intelligence. Code and Models are at https://github.com/invictus717/MiCo

View Paper