mmE5: Improving Multimodal Multilingual Embeddings via High-quality Synthetic Data

Haonan Chen, Liang Wang, Nan Yang, Yutao Zhu, Ziliang Zhao, Furu Wei, Zhicheng Dou

2025-02-14

mmE5: Improving Multimodal Multilingual Embeddings via High-quality
Synthetic Data

Summary

This paper talks about mmE5, a new AI model that can understand both text and images in multiple languages. The researchers found a way to create high-quality fake data to train this model, making it better at understanding different types of information across various languages.

What's the problem?

AI models that work with both text and images (called multimodal models) are getting popular, but they need a lot of labeled data to work well. There isn't enough real-world data available, so researchers have been trying to create fake data. However, the quality of this fake data hasn't been good enough to make the AI models truly effective.

What's the solution?

The researchers came up with three rules for making good fake data: it should cover many topics and types of information, the text and images should match well, and the fake data should look realistic. They used these rules to create a large set of fake data that includes different tasks, combines text and images in various ways, and covers multiple languages. They then used this high-quality fake data to train their new model, mmE5.

Why it matters?

This matters because it could make AI systems much better at understanding and working with different types of information across many languages. For example, it could lead to better image search engines that work in any language, or AI assistants that can understand and describe images for people who speak different languages. This could make technology more accessible and useful for people around the world, regardless of what language they speak or what type of information they're working with.

Abstract

Multimodal embedding models have gained significant attention for their ability to map data from different modalities, such as text and images, into a unified representation space. However, the limited labeled multimodal data often hinders embedding performance. Recent approaches have leveraged data synthesis to address this problem, yet the quality of synthetic data remains a critical bottleneck. In this work, we identify three criteria for high-quality synthetic multimodal data. First, broad scope ensures that the generated data covers diverse tasks and modalities, making it applicable to various downstream scenarios. Second, robust cross-modal alignment makes different modalities semantically consistent. Third, high fidelity ensures that the synthetic data maintains realistic details to enhance its reliability. Guided by these principles, we synthesize datasets that: (1) cover a wide range of tasks, modality combinations, and languages, (2) are generated via a deep thinking process within a single pass of a multimodal large language model, and (3) incorporate real-world images with accurate and relevant texts, ensuring fidelity through self-evaluation and refinement. Leveraging these high-quality synthetic and labeled datasets, we train a multimodal multilingual E5 model mmE5. Extensive experiments demonstrate that mmE5 achieves state-of-the-art performance on the MMEB Benchmark and superior multilingual performance on the XTD benchmark. Our codes, datasets and models are released in https://github.com/haon-chen/mmE5.

View Paper