Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding

Jiaqi Tang, Jianmin Chen, Wei Wei, Xiaogang Xu, Runtao Liu, Xiangyu Wu, Qipeng Xie, Jiafei Wu, Lei Zhang, Qifeng Chen

2025-12-22

Robust-R1: Degradation-Aware Reasoning for Robust Visual Understanding

Summary

This paper addresses the issue of how well AI models that 'see' and 'understand' images, called Multimodal Large Language Models, perform when the images are of poor quality or distorted in ways that happen in the real world.

What's the problem?

Current AI models struggle when faced with real-world image issues like blur, noise, poor lighting, or other distortions. Existing attempts to make these models more robust usually focus on improving how the model initially processes the image, but they don't explain *why* the model is making certain decisions or how it's handling the degradation. This makes it hard to improve the model in a focused way and understand its limitations.

What's the solution?

The researchers developed a new framework called Robust-R1 that specifically teaches the model to *reason* about image degradations. It does this in three main steps: first, it's trained to understand how different types of degradation affect images; second, it's given feedback to accurately identify the specific parameters of the degradation (like how blurry or noisy an image is); and third, it adjusts how deeply it analyzes the image based on how severe the degradation is. To help with this, they also created a new dataset of 11,000 images with realistic distortions and detailed explanations of how those distortions impact the image's meaning.

Why it matters?

This work is important because it makes AI vision systems more reliable in real-world situations where images aren't always perfect. By explicitly teaching the model to understand and reason about image quality, it performs better than existing methods and provides a more interpretable and adaptable approach to building robust AI.

Abstract

Multimodal Large Language Models struggle to maintain reliable performance under extreme real-world visual degradations, which impede their practical robustness. Existing robust MLLMs predominantly rely on implicit training/adaptation that focuses solely on visual encoder generalization, suffering from limited interpretability and isolated optimization. To overcome these limitations, we propose Robust-R1, a novel framework that explicitly models visual degradations through structured reasoning chains. Our approach integrates: (i) supervised fine-tuning for degradation-aware reasoning foundations, (ii) reward-driven alignment for accurately perceiving degradation parameters, and (iii) dynamic reasoning depth scaling adapted to degradation intensity. To facilitate this approach, we introduce a specialized 11K dataset featuring realistic degradations synthesized across four critical real-world visual processing stages, each annotated with structured chains connecting degradation parameters, perceptual influence, pristine semantic reasoning chain, and conclusion. Comprehensive evaluations demonstrate state-of-the-art robustness: Robust-R1 outperforms all general and robust baselines on the real-world degradation benchmark R-Bench, while maintaining superior anti-degradation performance under multi-intensity adversarial degradations on MMMB, MMStar, and RealWorldQA.

View Paper