IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting

Tao Zhang, Yuyang Hong, Yang Xia, Kun Ding, Zeyu Zhang, Ying Wang, Shiming Xiang, Chunhong Pan

2025-12-11

IF-Bench: Benchmarking and Enhancing MLLMs for Infrared Images with Generative Visual Prompting

Summary

This paper introduces a new way to test how well artificial intelligence models understand infrared images, which are images that show heat instead of visible light. It also proposes a method to help these models perform better with infrared images.

What's the problem?

Current AI models, specifically multimodal large language models, are really good at understanding images and text together, but nobody has really tested how well they understand infrared images. This is a problem because infrared images are used in important fields like medical imaging, security, and search and rescue, so we need AI that can interpret them accurately. There wasn't a standard way to measure this ability, making it hard to compare different AI models.

What's the solution?

The researchers created a benchmark called IF-Bench, which is a collection of almost 500 infrared images and 680 questions and answers about those images. They used this benchmark to test over 40 different AI models, using careful methods to make sure the results were reliable. They also developed a technique called GenViP that translates infrared images into regular color images, making it easier for the AI to understand them without needing to be specifically trained on infrared data.

Why it matters?

This work is important because it provides a standard way to evaluate AI's ability to understand infrared images. This will help researchers improve these models and make them more useful in real-world applications where infrared imaging is crucial. The GenViP method offers a practical way to improve performance without extensive retraining, which saves time and resources.

Abstract

Recent advances in multimodal large language models (MLLMs) have led to impressive progress across various benchmarks. However, their capability in understanding infrared images remains unexplored. To address this gap, we introduce IF-Bench, the first high-quality benchmark designed for evaluating multimodal understanding of infrared images. IF-Bench consists of 499 images sourced from 23 infrared datasets and 680 carefully curated visual question-answer pairs, covering 10 essential dimensions of image understanding. Based on this benchmark, we systematically evaluate over 40 open-source and closed-source MLLMs, employing cyclic evaluation, bilingual assessment, and hybrid judgment strategies to enhance the reliability of the results. Our analysis reveals how model scale, architecture, and inference paradigms affect infrared image comprehension, providing valuable insights for this area. Furthermore, we propose a training-free generative visual prompting (GenViP) method, which leverages advanced image editing models to translate infrared images into semantically and spatially aligned RGB counterparts, thereby mitigating domain distribution shifts. Extensive experiments demonstrate that our method consistently yields significant performance improvements across a wide range of MLLMs. The benchmark and code are available at https://github.com/casiatao/IF-Bench.

View Paper