M3DR: Towards Universal Multilingual Multimodal Document Retrieval
Adithya S Kolavi, Vyoman Jain
2025-12-08
Summary
This paper introduces a new system called M3DR that improves the ability of computers to search for documents using both images and text, even when those documents are in different languages.
What's the problem?
Current document search systems that use both images and text are mostly designed for English. This means they don't work very well when you need to search for documents in other languages, limiting their usefulness for a global audience and diverse cultural information.
What's the solution?
The researchers created M3DR, a system that uses artificially created multilingual data to train models to understand the connection between images and text across 22 different languages. They used a technique called contrastive training to help the models create a common understanding of text and images, regardless of the language. This system works with different types of search technology, like searching using a single description of a document or searching by comparing individual words and image parts.
Why it matters?
This work is important because it significantly improves cross-lingual document search, making information accessible to more people worldwide. The new system performs much better than existing methods, showing a substantial improvement in finding relevant documents even when the search query and document are in different languages, and it provides a new standard for evaluating these types of systems.
Abstract
Multimodal document retrieval systems have shown strong progress in aligning visual and textual content for semantic search. However, most existing approaches remain heavily English-centric, limiting their effectiveness in multilingual contexts. In this work, we present M3DR (Multilingual Multimodal Document Retrieval), a framework designed to bridge this gap across languages, enabling applicability across diverse linguistic and cultural contexts. M3DR leverages synthetic multilingual document data and generalizes across different vision-language architectures and model sizes, enabling robust cross-lingual and cross-modal alignment. Using contrastive training, our models learn unified representations for text and document images that transfer effectively across languages. We validate this capability on 22 typologically diverse languages, demonstrating consistent performance and adaptability across linguistic and script variations. We further introduce a comprehensive benchmark that captures real-world multilingual scenarios, evaluating models under monolingual, multilingual, and mixed-language settings. M3DR generalizes across both single dense vector and ColBERT-style token-level multi-vector retrieval paradigms. Our models, NetraEmbed and ColNetraEmbed achieve state-of-the-art performance with ~150% relative improvements on cross-lingual retrieval.