VLM2Vec-V2: Advancing Multimodal Embedding for Videos, Images, and Visual Documents
Rui Meng, Ziyan Jiang, Ye Liu, Mingyi Su, Xinyi Yang, Yuepeng Fu, Can Qin, Zeyuan Chen, Ran Xu, Caiming Xiong, Yingbo Zhou, Wenhu Chen, Semih Yavuz
2025-07-08
Summary
This paper talks about VLM2Vec-V2, a new framework that creates unified embeddings for different types of visual information like videos, images, and visual documents. It improves the way AI understands and relates these different visual formats by training on a diverse and comprehensive benchmark.
What's the problem?
The problem is that existing multimodal embedding models mostly focus on natural images and don’t do well with other visual data types like videos or documents. This limits their usability in many real-world applications such as video search, document retrieval, and AI agents.
What's the solution?
The researchers extended existing benchmarks to include tasks with videos and visual documents, then trained VLM2Vec-V2, a general-purpose embedding model that can handle text, images, videos, and documents together. This model can follow instructions to create embeddings that work well across many tasks and different data types.
Why it matters?
This matters because it makes AI better at understanding and working with all kinds of visual information, not just pictures. This leads to stronger performance in applications that require analyzing videos, reading documents, or combining text and images, making AI more flexible and useful.
Abstract
A unified framework VLM2Vec-V2 for multimodal embedding supports diverse visual forms, including videos and visual documents, improving performance across various tasks and benchmarks.