Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval
Fanheng Kong, Jingyuan Zhang, Yahui Liu, Hongzhi Zhang, Shi Feng, Xiaocui Yang, Daling Wang, Yu Tian, Victoria W., Fuzheng Zhang, Guorui Zhou
2025-05-28

Summary
This paper talks about UNITE, a new method for making AI better at finding and connecting information from different types of data, like text, images, and audio, by creating universal representations that work well across all of them.
What's the problem?
The problem is that it's really hard for AI to search and match information when it comes in many forms, because each type of data—like words, pictures, or sounds—has its own unique features, and most models struggle to handle them all together.
What's the solution?
To fix this, the researchers developed UNITE, which carefully selects and organizes data and uses a special training method that pays attention to the differences between types of data. This approach, called Modal-Aware Masked Contrastive Learning, helps the AI learn universal 'embeddings' that make it easier to find related information, no matter the format.
Why it matters?
This matters because it allows AI to search, match, and organize information from all kinds of sources much more effectively, which is super useful for things like online search engines, digital assistants, and any technology that deals with lots of mixed data.
Abstract
UNITE addresses challenges in multimodal information retrieval through data curation and modality-aware training, achieving state-of-the-art results across benchmarks with Modal-Aware Masked Contrastive Learning.