MINIMA: Modality Invariant Image Matching

Xingyu Jiang, Jiangwei Ren, Zizhuo Li, Xin Zhou, Dingkang Liang, Xiang Bai

2025-01-16

MINIMA: Modality Invariant Image Matching

Summary

This paper talks about MINIMA, a new way to make computers better at matching images that look different because they were taken with different types of cameras or in different styles. It's like teaching a computer to recognize that a photo and a painting of the same scene are actually showing the same thing.

What's the problem?

Right now, computers struggle to match images that look very different, even if they're of the same thing. This is because current methods are trained on limited types of images and don't work well when faced with new types of images they haven't seen before. It's like if you only taught someone to recognize dogs in photos, they might struggle to recognize dogs in cartoons.

What's the solution?

The researchers created MINIMA, which uses a clever trick to create a huge collection of different types of images for training. They start with regular color photos and use AI to transform them into many different styles. This gives them a large, diverse set of images to train their system on. They call this new dataset MD-syn. By training on this varied dataset, MINIMA learns to match images across many different styles and types, even ones it hasn't specifically seen before.

Why it matters?

This matters because it could make computers much better at understanding and working with different types of images. This could help in fields like medical imaging, where matching X-rays to MRIs could be really useful, or in self-driving cars that need to understand both regular camera images and infrared night vision. It could also help in art and design, making it easier for computers to understand the connections between different styles of images. Overall, it's a big step towards making computers see and understand the world more like humans do.

Abstract

Image matching for both cross-view and cross-modality plays a critical role in multimodal perception. In practice, the modality gap caused by different imaging systems/styles poses great challenges to the matching task. Existing works try to extract invariant features for specific modalities and train on limited datasets, showing poor generalization. In this paper, we present MINIMA, a unified image matching framework for multiple cross-modal cases. Without pursuing fancy modules, our MINIMA aims to enhance universal performance from the perspective of data scaling up. For such purpose, we propose a simple yet effective data engine that can freely produce a large dataset containing multiple modalities, rich scenarios, and accurate matching labels. Specifically, we scale up the modalities from cheap but rich RGB-only matching data, by means of generative models. Under this setting, the matching labels and rich diversity of the RGB dataset are well inherited by the generated multimodal data. Benefiting from this, we construct MD-syn, a new comprehensive dataset that fills the data gap for general multimodal image matching. With MD-syn, we can directly train any advanced matching pipeline on randomly selected modality pairs to obtain cross-modal ability. Extensive experiments on in-domain and zero-shot matching tasks, including 19 cross-modal cases, demonstrate that our MINIMA can significantly outperform the baselines and even surpass modality-specific methods. The dataset and code are available at https://github.com/LSXI7/MINIMA .

View Paper