Better Together: Leveraging Unpaired Multimodal Data for Stronger Unimodal Models
Sharut Gupta, Shobhita Sundaram, Chenyu Wang, Stefanie Jegelka, Phillip Isola
2025-10-13
Summary
This paper explores a new way to train AI models that understand different types of data, like images and text, without needing perfectly matched examples of both. It proposes a method to improve how these models learn by letting them practice on unpaired data from various sources.
What's the problem?
Most AI systems that handle multiple types of data, such as answering questions about images, require large datasets where images are directly paired with corresponding text questions and answers. Getting these paired datasets is expensive and time-consuming. The paper points out that we often have lots of data in different formats that *aren't* directly linked, and it asks if we can still use that data to improve the AI's understanding.
What's the solution?
The researchers developed a technique called UML, or Unpaired Multimodal Learner. Essentially, they created a single model that can process data from different sources – images, text, audio – one after the other, but importantly, it uses the same underlying 'brain' (shared parameters) for all of them. This allows the model to learn connections between these different types of data even if they weren't originally paired. They also showed mathematically that using this unpaired data can actually lead to a better understanding of the underlying information.
Why it matters?
This research is important because it opens up possibilities for training more powerful AI models with less reliance on expensive, labeled datasets. By leveraging readily available, unpaired data, we can improve performance on tasks like image recognition and audio processing, making AI more accessible and adaptable to real-world scenarios where perfectly matched data is rare.
Abstract
Traditional multimodal learners find unified representations for tasks like visual question answering, but rely heavily on paired datasets. However, an overlooked yet potentially powerful question is: can one leverage auxiliary unpaired multimodal data to directly enhance representation learning in a target modality? We introduce UML: Unpaired Multimodal Learner, a modality-agnostic training paradigm in which a single model alternately processes inputs from different modalities while sharing parameters across them. This design exploits the assumption that different modalities are projections of a shared underlying reality, allowing the model to benefit from cross-modal structure without requiring explicit pairs. Theoretically, under linear data-generating assumptions, we show that unpaired auxiliary data can yield representations strictly more informative about the data-generating process than unimodal training. Empirically, we show that using unpaired data from auxiliary modalities -- such as text, audio, or images -- consistently improves downstream performance across diverse unimodal targets such as image and audio. Our project page: https://unpaired-multimodal.github.io/