Diving into Self-Evolving Training for Multimodal Reasoning

Wei Liu, Junlong Li, Xiwen Zhang, Fan Zhou, Yu Cheng, Junxian He

2024-12-24

Diving into Self-Evolving Training for Multimodal Reasoning

Summary

This paper talks about MSTaR, a new framework for training large multimodal models (LMMs) to improve their reasoning abilities by learning from their own outputs without needing a lot of human-annotated data.

What's the problem?

Many complex reasoning tasks require a lot of annotated data to train models effectively, but this data is often scarce or hard to obtain. As a result, models may struggle to learn how to reason well across different types of information, such as text and images, which limits their performance.

What's the solution?

The authors introduce self-evolving training as a solution, where models improve by learning from their own responses. They identify three key factors that affect this training: the method used for training, the reward model that helps the model learn what is good or bad, and the variation in prompts given to the model. By analyzing these factors, they developed MSTaR, which optimizes the training process and helps models balance exploration (trying new ideas) and exploitation (using what they already know). The framework has been tested on various benchmarks and shows significant improvements in reasoning abilities without needing extra human input.

Why it matters?

This research is important because it provides a way for AI models to become better at reasoning tasks in a more efficient manner. By using self-evolving training, MSTaR can help advance the capabilities of AI in understanding and processing complex information, making it useful for applications in areas like education, healthcare, and automated decision-making.

Abstract

Reasoning ability is essential for Large Multimodal Models (LMMs). In the absence of multimodal chain-of-thought annotated data, self-evolving training, where the model learns from its own outputs, has emerged as an effective and scalable approach for enhancing reasoning abilities. Despite its growing usage, a comprehensive understanding of self-evolving training, particularly in the context of multimodal reasoning, remains limited. In this paper, we delve into the intricacies of self-evolving training for multimodal reasoning, pinpointing three key factors: Training Method, Reward Model, and Prompt Variation. We systematically examine each factor and explore how various configurations affect the training's effectiveness. Our analysis leads to a set of best practices for each factor, aimed at optimizing multimodal reasoning. Furthermore, we explore the Self-Evolution Dynamics during training and the impact of automatic balancing mechanisms in boosting performance. After all the investigations, we present a final recipe for self-evolving training in multimodal reasoning, encapsulating these design choices into a framework we call MSTaR (Multimodal Self-evolving Training for Reasoning), which is universally effective for models with different sizes on various benchmarks, e.g., surpassing the pre-evolved model significantly on 5 multimodal reasoning benchmarks without using additional human annotations, as demonstrated on MiniCPM-V-2.5 (8B), Phi-3.5-Vision (4B) and InternVL2 (2B). We believe this study fills a significant gap in the understanding of self-evolving training for multimodal reasoning and offers a robust framework for future research. Our policy and reward models, as well as the collected data, is released to facilitate further investigation in multimodal reasoning.

View Paper