CLS-RL: Image Classification with Rule-Based Reinforcement Learning
Ming Li, Shitian Zhao, Jike Zhong, Yuxiang Lai, Kaipeng Zhang
2025-03-21
Summary
This paper is about improving how AI can classify images, especially when there isn't a lot of labeled training data.
What's the problem?
Training AI to classify images usually requires a lot of labeled data, which is expensive and time-consuming to get. Also, simply fine-tuning AI models with limited data can lead to poor performance.
What's the solution?
The researchers developed a method called CLS-RL that uses a type of learning called reinforcement learning, where the AI is rewarded for making correct classifications based on certain rules. This helps the AI learn to classify images more effectively with less data.
Why it matters?
This work matters because it can make AI image classification more accessible and efficient, especially in situations where labeled data is scarce.
Abstract
Classification is a core task in machine learning. Recent research has shown that although Multimodal Large Language Models (MLLMs) are initially poor at image classification, fine-tuning them with an adequate amount of data can significantly enhance their performance, making them comparable to SOTA classification models. However, acquiring large-scale labeled data is expensive. In this paper, we explore few-shot MLLM classification fine-tuning. We found that SFT can cause severe overfitting issues and may even degrade performance over the zero-shot approach. To address this challenge, inspired by the recent successes in rule-based reinforcement learning, we propose CLS-RL, which uses verifiable signals as reward to fine-tune MLLMs. We discovered that CLS-RL outperforms SFT in most datasets and has a much higher average accuracy on both base-to-new and few-shot learning setting. Moreover, we observed a free-lunch phenomenon for CLS-RL; when models are fine-tuned on a particular dataset, their performance on other distinct datasets may also improve over zero-shot models, even if those datasets differ in distribution and class names. This suggests that RL-based methods effectively teach models the fundamentals of classification. Lastly, inspired by recent works in inference time thinking, we re-examine the `thinking process' during fine-tuning, a critical aspect of RL-based methods, in the context of visual classification. We question whether such tasks require extensive thinking process during fine-tuning, proposing that this may actually detract from performance. Based on this premise, we introduce the No-Thinking-CLS-RL method, which minimizes thinking processes during training by setting an equality accuracy reward. Our findings indicate that, with much less fine-tuning time, No-Thinking-CLS-RL method achieves superior in-domain performance and generalization capabilities than CLS-RL.