UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

Zhibin Lan, Liqiang Niu, Fandong Meng, Jie Zhou, Jinsong Su

2025-11-04

UME-R1: Exploring Reasoning-Driven Generative Multimodal Embeddings

Summary

This paper introduces a new way to create multimodal embeddings, which are like digital representations of information from different sources like images and text, using the power of large language models. It moves beyond simply *identifying* what's in an image or video to actually *reasoning* about it and generating more useful embeddings.

What's the problem?

Current multimodal large language models are good at figuring out *what* is in an image or video, but they struggle with deeper reasoning. They create 'discriminative' embeddings, meaning they're good at telling things apart, but not so good at understanding relationships or generating new insights. This limits their ability to perform complex tasks that require thinking things through.

What's the solution?

The researchers developed a framework called UME-R1. It works in two steps: first, they 'teach' the model reasoning skills using supervised learning, allowing it to create both discriminative and 'generative' embeddings (embeddings that can generate new information). Second, they use reinforcement learning to further improve the reasoning ability and the quality of these generative embeddings. Essentially, they're training the model to not just *see* but to *understand* and *create*.

Why it matters?

This work is important because it shows that generative embeddings unlock a lot more potential in multimodal models. By enabling reasoning, the model performs significantly better on a wide range of tasks involving images, videos, and visual documents. It also suggests a way to make these models more interpretable and scalable, meaning they can handle more complex problems and be used in more applications.

Abstract

The remarkable success of multimodal large language models (MLLMs) has driven advances in multimodal embeddings, yet existing models remain inherently discriminative, limiting their ability to benefit from reasoning-driven generation paradigm. In this work, we pioneer the exploration of generative embeddings, unifying embedding tasks within a generative paradigm. We propose UME-R1, a universal multimodal embedding framework consisting of a two-stage training strategy: a cold-start supervised fine-tuning equips the model with reasoning capabilities and enables it to generate both discriminative and generative embeddings; a subsequent reinforcement learning enhances reasoning and further optimizes generative embedding quality. This pioneering work reveals four key insights: 1) generative embeddings unlock substantial performance gains over conventional discriminative embeddings by leveraging the powerful generative reasoning capabilities of MLLMs; 2) discriminative and generative embeddings are complementary, whose combined oracle performance far exceeding that of either alone; 3) RL can effectively enhance generative embeddings, establishing a scalable optimization paradigm.; 4) repeated sampling at inference boosts downstream task coverage (pass@k), highlighting the inference-time scalability potential of generative embeddings. Evaluated on the MMEB-V2 benchmark across 78 tasks spanning video, image, and visual documents, UME-R1 significantly outperforms conventional discriminative embedding models and offers a foundation for more interpretable, reasoning-driven generative multimodal embeddings. Our code, models, and datasets will be publicly available at https://github.com/XMUDeepLIT/UME-R1.

View Paper