e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings

Haonan Chen, Sicheng Gao, Radu Timofte, Tetsuya Sakai, Zhicheng Dou

2026-01-13

e5-omni: Explicit Cross-modal Alignment for Omni-modal Embeddings

Summary

This paper introduces a new method, called e5-omni, for creating better omni-modal embedding models. These models are designed to understand and compare different types of data like text, images, and audio by representing them all in a common mathematical space.

What's the problem?

Current omni-modal models often struggle because they rely too much on the way they were initially trained with image and text data together. This leads to a few issues: the 'similarity scores' between different data types aren't consistently scaled, making comparisons difficult; it becomes easy for the model to ignore challenging examples during training, hindering improvement; and the way different data types are represented in the shared space isn't geometrically aligned, causing unstable rankings of results.

What's the solution?

The researchers developed e5-omni, which improves existing image-text models with three key techniques. First, they adjust the 'temperature' of the similarity scores to make them comparable across different data types. Second, they carefully select which examples the model learns from, focusing on the most confusing ones and avoiding easy ones. Finally, they use a technique called 'batch whitening' to align the geometric structure of the different data types within the shared space, making the rankings more reliable.

Why it matters?

This work is important because it makes omni-modal models more accurate and reliable. This is crucial for applications where you need to compare and understand different types of information, like searching for something using both a text description and an image, or automatically captioning videos with relevant audio descriptions. The method is also flexible and can be used with different existing models, making it widely applicable.

Abstract

Modern information systems often involve different types of items, e.g., a text query, an image, a video clip, or an audio segment. This motivates omni-modal embedding models that map heterogeneous modalities into a shared space for direct comparison. However, most recent omni-modal embeddings still rely heavily on implicit alignment inherited from pretrained vision-language model (VLM) backbones. In practice, this causes three common issues: (i) similarity logits have modality-dependent sharpness, so scores are not on a consistent scale; (ii) in-batch negatives become less effective over time because mixed-modality batches create an imbalanced hardness distribution; as a result, many negatives quickly become trivial and contribute little gradient; and (iii) embeddings across modalities show mismatched first- and second-order statistics, which makes rankings less stable. To tackle these problems, we propose e5-omni, a lightweight explicit alignment recipe that adapts off-the-shelf VLMs into robust omni-modal embedding models. e5-omni combines three simple components: (1) modality-aware temperature calibration to align similarity scales, (2) a controllable negative curriculum with debiasing to focus on confusing negatives while reducing the impact of false negatives, and (3) batch whitening with covariance regularization to better match cross-modal geometry in the shared embedding space. Experiments on MMEB-V2 and AudioCaps show consistent gains over strong bi-modal and omni-modal baselines, and the same recipe also transfers well to other VLM backbones. We release our model checkpoint at https://huggingface.co/Haon-Chen/e5-omni-7B.

View Paper