Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?

Jie Zhu, Yiyang Su, Xiaoming Liu

2026-01-13

Can Textual Reasoning Improve the Performance of MLLMs on Fine-grained Visual Classification?

Summary

This paper investigates why getting large AI models to 'think step-by-step' actually makes them *worse* at detailed image recognition, a task where they usually struggle. They then propose a new method to help these models reason better without sacrificing accuracy in visual tasks.

What's the problem?

Large language models that can understand both text and images are good at many things, but they often have trouble with Fine-Grained Visual Classification, which means telling apart very similar things in pictures – like different species of birds. A common technique to improve AI performance, called Chain-of-Thought reasoning (where the AI explains its thinking), surprisingly *decreases* accuracy in these visual tasks. Previous research hasn't fully explained why this happens, just that it does.

What's the solution?

The researchers found that the problem isn't the reasoning itself, but *how much* the AI reasons. Longer explanations actually lead to more mistakes. They developed a system called ReFine-RFT that tackles this 'Cost of Thinking' in two ways: first, it balances different types of feedback the AI receives during training, and second, it limits the length of the AI’s reasoning process while still giving it enough information to be accurate. This system uses a new normalization method called \alg to achieve this.

Why it matters?

This research is important because it identifies a key flaw in applying reasoning techniques to visual AI. By understanding why 'thinking' can hurt performance, and by creating a method to control it, we can build more reliable and accurate AI systems for real-world applications that require precise image understanding, like medical diagnosis or environmental monitoring.

Abstract

Multi-modal large language models (MLLMs) exhibit strong general-purpose capabilities, yet still struggle on Fine-Grained Visual Classification (FGVC), a core perception task that requires subtle visual discrimination and is crucial for many real-world applications. A widely adopted strategy for boosting performance on challenging tasks such as math and coding is Chain-of-Thought (CoT) reasoning. However, several prior works have reported that CoT can actually harm performance on visual perception tasks. These studies, though, examine the issue from relatively narrow angles and leave open why CoT degrades perception-heavy performance. We systematically re-examine the role of CoT in FGVC through the lenses of zero-shot evaluation and multiple training paradigms. Across these settings, we uncover a central paradox: the degradation induced by CoT is largely driven by the reasoning length, in which longer textual reasoning consistently lowers classification accuracy. We term this phenomenon the ``Cost of Thinking''. Building on this finding, we make two key contributions: (1) \alg, a simple and general plug-and-play normalization method for multi-reward optimization that balances heterogeneous reward signals, and (2) ReFine-RFT, a framework that combines ensemble rewards with \alg to constrain reasoning length while providing dense accuracy-oriented feedback. Extensive experiments demonstrate the effectiveness of our findings and the proposed ReFine-RFT, achieving state-of-the-art performance across FGVC benchmarks. Code and models are available at https://github.com/jiezhu23/ReFine-RFT{Project Link}.

View Paper