OmniCorpus: A Unified Multimodal Corpus of 10 Billion-Level Images Interleaved with Text
Qingyun Li, Zhe Chen, Weiyun Wang, Wenhai Wang, Shenglong Ye, Zhenjiang Jin, Guanzhou Chen, Yinan He, Zhangwei Gao, Erfei Cui, Jiashuo Yu, Hao Tian, Jiasheng Zhou, Chao Xu, Bin Wang, Xingjian Wei, Wei Li, Wenjian Zhang, Bo Zhang, Pinlong Cai, Licheng Wen, Xiangchao Yan
2024-06-17

Summary
This paper introduces OmniCorpus, a massive dataset that combines images and text to help improve how AI models understand and generate content. It contains 10 billion pieces of data, including 8.6 billion images and 1,696 billion text tokens, arranged in a way that mimics how people naturally read and process information.
What's the problem?
Current datasets used for training AI models often lack the scale and diversity needed for effective learning. Many existing image-text datasets are too small or focus mainly on English content, which limits their usefulness for developing advanced multimodal models that can understand both images and text. This restriction makes it harder for AI to learn from the rich variety of information available on the internet.
What's the solution?
To solve this problem, the authors created OmniCorpus, which is significantly larger than previous datasets and includes a wide range of sources from both English and non-English websites, as well as video content. The dataset is structured to allow easy access to both images and text, making it flexible for different types of AI tasks. The authors used an efficient data processing engine to ensure high quality while maintaining a large scale.
Why it matters?
This research is important because it provides a solid foundation for future developments in multimodal AI. By offering a comprehensive dataset that reflects real-world data more accurately, OmniCorpus can help improve the performance of AI models in understanding complex information that combines text and visuals. This advancement could lead to better applications in areas like education, content creation, and data analysis.
Abstract
Image-text interleaved data, consisting of multiple images and texts arranged in a natural document format, aligns with the presentation paradigm of internet data and closely resembles human reading habits. Recent studies have shown that such data aids multimodal in-context learning and maintains the capabilities of large language models during multimodal fine-tuning. However, the limited scale and diversity of current image-text interleaved data restrict the development of multimodal large language models. In this paper, we introduce OmniCorpus, a 10 billion-scale image-text interleaved dataset. Using an efficient data engine, we filter and extract large-scale high-quality documents, which contain 8.6 billion images and 1,696 billion text tokens. Compared to counterparts (e.g., MMC4, OBELICS), our dataset 1) has 15 times larger scales while maintaining good data quality; 2) features more diverse sources, including both English and non-English websites as well as video-centric websites; 3) is more flexible, easily degradable from an image-text interleaved format to pure text corpus and image-text pairs. Through comprehensive analysis and experiments, we validate the quality, usability, and effectiveness of the proposed dataset. We hope this could provide a solid data foundation for future multimodal model research. Code and data are released at https://github.com/OpenGVLab/OmniCorpus.