RealSyn: An Effective and Scalable Multimodal Interleaved Document Transformation Paradigm

Tiancheng Gu, Kaicheng Yang, Chaoyi Zhang, Yin Xie, Xiang An, Ziyong Feng, Dongnan Liu, Weidong Cai, Jiankang Deng

2025-02-19

RealSyn: An Effective and Scalable Multimodal Interleaved Document
Transformation Paradigm

Summary

This paper talks about RealSyn, a new way to teach AI systems to understand the relationship between images and text better by using a mix of real-world documents and artificially created text descriptions.

What's the problem?

Current AI models that work with both images and text are trained on paired data, where each image has a matching text description. However, there's a lot of information in documents that contain both images and text that isn't being used because the images and text aren't directly paired. This means we're missing out on valuable data that could help AI systems understand the world better.

What's the solution?

The researchers created RealSyn, which does several things to solve this problem. First, it extracts high-quality images and text from real-world documents. Then, it uses a smart system to match images with relevant text descriptions. To make the data even more useful, it generates additional text descriptions for the images. Finally, it uses a special sampling method to ensure the dataset includes a wide variety of concepts, even rare ones. They created three versions of the RealSyn dataset, with 15 million, 30 million, and 100 million image-text pairs.

Why it matters?

This matters because it could significantly improve AI systems that work with both images and text. These improvements could lead to better performance in tasks like image search, automatic image captioning, and visual question answering. By making use of previously untapped data and creating more diverse datasets, RealSyn could help AI systems understand and interact with visual and textual information in more human-like ways. This could have wide-ranging applications in fields like education, accessibility, and information retrieval.

Abstract

After pre-training on extensive image-text pairs, Contrastive Language-Image Pre-training (CLIP) demonstrates promising performance on a wide variety of benchmarks. However, a substantial volume of non-paired data, such as multimodal interleaved documents, remains underutilized for vision-language representation learning. To fully leverage these unpaired documents, we initially establish a Real-World Data Extraction pipeline to extract high-quality images and texts. Then we design a hierarchical retrieval method to efficiently associate each image with multiple semantically relevant realistic texts. To further enhance fine-grained visual information, we propose an image semantic augmented generation module for synthetic text production. Furthermore, we employ a semantic balance sampling strategy to improve dataset diversity, enabling better learning of long-tail concepts. Based on these innovations, we construct RealSyn, a dataset combining realistic and synthetic texts, available in three scales: 15M, 30M, and 100M. Extensive experiments demonstrate that RealSyn effectively advances vision-language representation learning and exhibits strong scalability. Models pre-trained on RealSyn achieve state-of-the-art performance on multiple downstream tasks. To facilitate future research, the RealSyn dataset and pre-trained model weights are released at https://github.com/deepglint/RealSyn.

View Paper