Improve Vision Language Model Chain-of-thought Reasoning
Ruohong Zhang, Bowen Zhang, Yanghao Li, Haotian Zhang, Zhiqing Sun, Zhe Gan, Yinfei Yang, Ruoming Pang, Yiming Yang
2024-10-23

Summary
This paper discusses how to improve the reasoning abilities of vision-language models (VLMs) by focusing on chain-of-thought (CoT) reasoning, which helps these models provide clearer and more trustworthy answers.
What's the problem?
Current training methods for VLMs often rely on short answers that don't provide enough detail for complex reasoning tasks. This can lead to poor performance when the model needs to explain its thought process or make decisions based on visual information.
What's the solution?
The researchers propose a two-part approach to enhance CoT reasoning. First, they use a powerful model called GPT-4o to create detailed explanations (rationales) that enrich the training data. Second, they apply reinforcement learning to fine-tune the model's reasoning by comparing correct and incorrect responses. This helps the model learn from its mistakes and improve its reasoning skills over time.
Why it matters?
Improving CoT reasoning in VLMs is important because it makes these models more reliable and easier to understand. Better reasoning capabilities can lead to more accurate results in applications like image recognition and natural language processing, which are increasingly used in everyday technology.
Abstract
Chain-of-thought (CoT) reasoning in vision language models (VLMs) is crucial for improving interpretability and trustworthiness. However, current training recipes lack robust CoT reasoning data, relying on datasets dominated by short annotations with minimal rationales. In this work, we show that training VLM on short answers does not generalize well to reasoning tasks that require more detailed responses. To address this, we propose a two-fold approach. First, we distill rationales from GPT-4o model to enrich the training data and fine-tune VLMs, boosting their CoT performance. Second, we apply reinforcement learning to further calibrate reasoning quality. Specifically, we construct positive (correct) and negative (incorrect) pairs of model-generated reasoning chains, by comparing their predictions with annotated short answers. Using this pairwise data, we apply the Direct Preference Optimization algorithm to refine the model's reasoning abilities. Our experiments demonstrate significant improvements in CoT reasoning on benchmark datasets and better generalization to direct answer prediction as well. This work emphasizes the importance of incorporating detailed rationales in training and leveraging reinforcement learning to strengthen the reasoning capabilities of VLMs.