Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Mingxin Li, Yanzhao Zhang, Dingkun Long, Keqin Chen, Sibo Song, Shuai Bai, Zhibo Yang, Pengjun Xie, An Yang, Dayiheng Liu, Jingren Zhou, Junyang Lin

2026-01-12

Qwen3-VL-Embedding and Qwen3-VL-Reranker: A Unified Framework for State-of-the-Art Multimodal Retrieval and Ranking

Summary

This paper introduces a new set of AI models called Qwen3-VL-Embedding and Qwen3-VL-Reranker, which are designed to find relevant information across different types of data like text, images, and videos.

What's the problem?

Currently, searching for information across different types of data – like finding an image that matches a text description, or a video clip related to a document – is difficult because computers struggle to understand the meaning of these different formats in a unified way. Existing methods often aren't very accurate when trying to connect these different 'modalities' of information.

What's the solution?

The researchers created two models that work together. First, Qwen3-VL-Embedding converts text, images, videos, and documents into a numerical representation, kind of like a code, that captures their meaning. This model is trained in stages to get really good at understanding what things *mean*. Second, Qwen3-VL-Reranker takes a search query and a potential result and precisely determines how well they match, using a sophisticated method that pays attention to the relationships between the query and the result. These models can handle many languages and come in different sizes to fit different computing needs.

Why it matters?

These models significantly improve the accuracy of multimodal search, meaning you can find what you're looking for more easily when searching across different types of data. They’ve achieved top performance on standard tests, outperforming other models currently available, and have potential applications in areas like image search, visual question answering, and finding relevant video clips.

Abstract

In this report, we introduce the Qwen3-VL-Embedding and Qwen3-VL-Reranker model series, the latest extensions of the Qwen family built on the Qwen3-VL foundation model. Together, they provide an end-to-end pipeline for high-precision multimodal search by mapping diverse modalities, including text, images, document images, and video, into a unified representation space. The Qwen3-VL-Embedding model employs a multi-stage training paradigm, progressing from large-scale contrastive pre-training to reranking model distillation, to generate semantically rich high-dimensional vectors. It supports Matryoshka Representation Learning, enabling flexible embedding dimensions, and handles inputs up to 32k tokens. Complementing this, Qwen3-VL-Reranker performs fine-grained relevance estimation for query-document pairs using a cross-encoder architecture with cross-attention mechanisms. Both model series inherit the multilingual capabilities of Qwen3-VL, supporting more than 30 languages, and are released in 2B and 8B parameter sizes to accommodate diverse deployment requirements. Empirical evaluations demonstrate that the Qwen3-VL-Embedding series achieves state-of-the-art results across diverse multimodal embedding evaluation benchmarks. Specifically, Qwen3-VL-Embedding-8B attains an overall score of 77.8 on MMEB-V2, ranking first among all models (as of January 8, 2025). This report presents the architecture, training methodology, and practical capabilities of the series, demonstrating their effectiveness on various multimodal retrieval tasks, including image-text retrieval, visual question answering, and video-text matching.

View Paper