Demystifying the Visual Quality Paradox in Multimodal Large Language Models

Shuo Xing, Lanqing Guo, Hongyuan Hua, Seoyoung Lee, Peiran Li, Yufei Wang, Zhangyang Wang, Zhengzhong Tu

2025-06-24

Demystifying the Visual Quality Paradox in Multimodal Large Language
Models

Summary

This paper talks about a surprising finding called the visual-quality paradox, where multimodal large language models (MLLMs) sometimes perform better on vision-language tasks when images are not perfectly clear or high quality.

What's the problem?

The problem is that it was commonly believed that clearer, cleaner images would always help models understand better, but this study shows that higher image quality does not always improve model performance and can even make some tasks harder for the models.

What's the solution?

The researchers developed a method called Visual-Quality Test-Time Tuning (VQ-TTT) that adjusts images during testing to better match the model's preferences. This method uses a small learnable layer to change the frequency content of images and fine-tunes parts of the vision encoder to improve results without needing extra data or big changes to the model.

Why it matters?

This matters because it challenges assumptions about image quality and shows that adapting images to the model's needs can improve AI performance. It suggests future AI systems should use adaptive image processing to better understand visual information and perform tasks more accurately.

Abstract

Visual-Quality Test-Time Tuning (VQ-TTT) improves Multimodal Large Language Models (MLLMs) performance on vision-language tasks by dynamically adjusting input images, demonstrating the importance of adaptive rather than universally clean imagery.

View Paper