Modality Curation: Building Universal Embeddings for Advanced Multimodal Information Retrieval

Fanheng Kong, Jingyuan Zhang, Yahui Liu, Hongzhi Zhang, Shi Feng, Xiaocui Yang, Daling Wang, Yu Tian, Victoria W., Fuzheng Zhang, Guorui Zhou

2025-05-28

Modality Curation: Building Universal Embeddings for Advanced Multimodal
Information Retrieval

Summary

This paper talks about UNITE, a new method for making AI better at finding and connecting information from different types of data, like text, images, and audio, by creating universal representations that work well across all of them.

What's the problem?

The problem is that it's really hard for AI to search and match information when it comes in many forms, because each type of data—like words, pictures, or sounds—has its own unique features, and most models struggle to handle them all together.

What's the solution?

To fix this, the researchers developed UNITE, which carefully selects and organizes data and uses a special training method that pays attention to the differences between types of data. This approach, called Modal-Aware Masked Contrastive Learning, helps the AI learn universal 'embeddings' that make it easier to find related information, no matter the format.

Why it matters?

This matters because it allows AI to search, match, and organize information from all kinds of sources much more effectively, which is super useful for things like online search engines, digital assistants, and any technology that deals with lots of mixed data.

Abstract

UNITE addresses challenges in multimodal information retrieval through data curation and modality-aware training, achieving state-of-the-art results across benchmarks with Modal-Aware Masked Contrastive Learning.

View Paper