Progressive Multimodal Reasoning via Active Retrieval
Guanting Dong, Chenghao Zhang, Mengjie Deng, Yutao Zhu, Zhicheng Dou, Ji-Rong Wen
2024-12-20

Summary
This paper presents a new framework called AR-MCTS, which helps large language models (LLMs) improve their ability to reason through complex tasks that involve both text and images. It uses a method called Active Retrieval to gather useful information while the model is working on a problem.
What's the problem?
Multimodal reasoning tasks, which require understanding and combining information from different sources like text and images, can be very challenging for LLMs. Existing methods struggle to effectively verify their reasoning processes, leading to less reliable outcomes.
What's the solution?
The authors developed AR-MCTS, which combines Active Retrieval with a technique called Monte Carlo Tree Search (MCTS). This framework allows the model to automatically gather relevant insights at each step of the reasoning process. It also generates detailed annotations that help verify the reasoning, making it more reliable and diverse compared to traditional methods.
Why it matters?
This research is significant because it enhances how AI models can understand and reason about complex information from multiple sources. By improving the reasoning capabilities of LLMs, AR-MCTS can lead to better performance in applications like question answering, image analysis, and other tasks that require deep understanding of both text and visual data.
Abstract
Multi-step multimodal reasoning tasks pose significant challenges for multimodal large language models (MLLMs), and finding effective ways to enhance their performance in such scenarios remains an unresolved issue. In this paper, we propose AR-MCTS, a universal framework designed to progressively improve the reasoning capabilities of MLLMs through Active Retrieval (AR) and Monte Carlo Tree Search (MCTS). Our approach begins with the development of a unified retrieval module that retrieves key supporting insights for solving complex reasoning problems from a hybrid-modal retrieval corpus. To bridge the gap in automated multimodal reasoning verification, we employ the MCTS algorithm combined with an active retrieval mechanism, which enables the automatic generation of step-wise annotations. This strategy dynamically retrieves key insights for each reasoning step, moving beyond traditional beam search sampling to improve the diversity and reliability of the reasoning space. Additionally, we introduce a process reward model that aligns progressively to support the automatic verification of multimodal reasoning tasks. Experimental results across three complex multimodal reasoning benchmarks confirm the effectiveness of the AR-MCTS framework in enhancing the performance of various multimodal models. Further analysis demonstrates that AR-MCTS can optimize sampling diversity and accuracy, yielding reliable multimodal reasoning.