MAmmoTH-VL: Eliciting Multimodal Reasoning with Instruction Tuning at Scale
Jarvis Guo, Tuney Zheng, Yuelin Bai, Bo Li, Yubo Wang, King Zhu, Yizhi Li, Graham Neubig, Wenhu Chen, Xiang Yue
2024-12-09

Summary
This paper talks about MAmmoTH-VL, a new method for improving multimodal large language models (MLLMs) by creating a large dataset that helps these models reason better when processing both text and images.
What's the problem?
Current datasets used to train multimodal models often come from academic sources and focus on simple tasks. They typically provide only short answers without explaining the reasoning behind them. This limits the models' ability to understand complex questions and scenarios that require deeper thinking.
What's the solution?
The authors created a new dataset containing 12 million instruction-response pairs that include detailed explanations, or 'intermediate rationales,' to support better reasoning. This dataset is designed to help the models learn how to think through problems step by step, a technique known as Chain-of-Thought (CoT) reasoning. They tested their model on various benchmarks and found that it significantly improved performance in reasoning tasks compared to previous models.
Why it matters?
This research is important because it enhances the capabilities of AI systems in understanding and processing information from multiple sources, like text and images. By providing a richer dataset for training, MAmmoTH-VL can lead to more intelligent AI applications that can tackle complex problems, making technology more effective in areas such as education, healthcare, and customer service.
Abstract
Open-source multimodal large language models (MLLMs) have shown significant potential in a broad range of multimodal tasks. However, their reasoning capabilities remain constrained by existing instruction-tuning datasets, which were predominately repurposed from academic datasets such as VQA, AI2D, and ChartQA. These datasets target simplistic tasks, and only provide phrase-level answers without any intermediate rationales. To address these challenges, we introduce a scalable and cost-effective method to construct a large-scale multimodal instruction-tuning dataset with rich intermediate rationales designed to elicit CoT reasoning. Using only open models, we create a dataset containing 12M instruction-response pairs to cover diverse, reasoning-intensive tasks with detailed and faithful rationales. Experiments demonstrate that training MLLMs on this dataset significantly improves reasoning capabilities, achieving state-of-the-art performance on benchmarks such as MathVerse (+8.1%), MMMU-Pro (+7%), and MuirBench (+13.3%). Additionally, the model demonstrates notable improvements of up to 4% on non-reasoning-based benchmarks. Ablation studies further highlight the importance of key components, such as rewriting and self-filtering, in the dataset construction process.