MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

Xi Chen, Mingkang Zhu, Shaoteng Liu, Xiaoyang Wu, Xiaogang Xu, Yu Liu, Xiang Bai, Hengshuang Zhao

2025-06-30

MiCo: Multi-image Contrast for Reinforcement Visual Reasoning

Summary

This paper talks about MiCo, a method that helps vision-language models get better at thinking through problems involving multiple images by teaching themselves using sets of three related pictures.

What's the problem?

Vision-language models usually need lots of questions and answers that humans create to learn how to reason with multiple images, but making these labeled examples is slow and expensive.

What's the solution?

MiCo solves this by using a way of learning called self-supervised learning, where the model compares and learns from groups of three related images without needing any human-made labels, which improves its ability to understand and reason across several pictures.

Why it matters?

This matters because it allows AI to become smarter at understanding complex visual information from multiple images with much less effort and cost, making it easier to use in areas like visual storytelling, robotics, and data analysis.

Abstract

Self-supervised learning using image triplets enhances the reasoning ability of Vision-Language Models (VLMs) on multi-image tasks without the need for human-annotated question-answer pairs.

View Paper