Don't Look Only Once: Towards Multimodal Interactive Reasoning with Selective Visual Revisitation

Jiwan Chung, Junhyeok Kim, Siyeol Kim, Jaeyoung Lee, Min Soo Kim, Youngjae Yu

2025-06-02

Don't Look Only Once: Towards Multimodal Interactive Reasoning with
Selective Visual Revisitation

Summary

This paper talks about a new approach called v1 that helps AI models do a better job at understanding and reasoning about both images and text by letting them go back and look at specific parts of an image more than once while answering questions.

What's the problem?

The problem is that most multimodal AI models, which handle both pictures and words, usually look at an image just one time and try to remember everything, which can make them miss important details or make mistakes when reasoning about what they see.

What's the solution?

The researchers improved these models by allowing them to revisit and focus on different regions of an image as needed, instead of just relying on their first impression. This selective and dynamic revisiting helps the AI gather more accurate information and reason better when dealing with complex tasks that involve both visuals and language.

Why it matters?

This is important because it means AI can now handle more complicated questions and tasks that require careful attention to both images and text, making them more helpful for things like education, research, and real-world problem solving.

Abstract

v1 enhances Multimodal Large Language Models by enabling selective and dynamic visual region retrieval during inference, improving performance on multimodal reasoning tasks.

View Paper