Visual Question Decomposition on Multimodal Large Language Models

Haowei Zhang, Jianzhe Liu, Zhen Han, Shuo Chen, Bailan He, Volker Tresp, Zhiqiang Xu, Jindong Gu

2024-10-01

Visual Question Decomposition on Multimodal Large Language Models

Summary

This paper explores how to improve the ability of Multimodal Large Language Models (MLLMs) to break down complex questions into simpler parts, making it easier for them to provide accurate answers.

What's the problem?

Multimodal Large Language Models are powerful tools that can understand and generate text and images, but they often struggle with complex questions that require breaking them down into smaller, more manageable sub-questions. This limits their effectiveness in providing detailed and accurate responses.

What's the solution?

To address this issue, the authors introduced a new evaluation framework and dataset called DecoVQA+ specifically designed for training MLLMs to better decompose questions. They developed a fine-tuning process that helps these models learn how to create high-quality sub-questions. The results showed that after training with this new dataset, the models significantly improved their ability to produce useful sub-questions and perform better on visual question answering tasks.

Why it matters?

This research is important because it enhances the capabilities of MLLMs, allowing them to handle more complex queries effectively. By improving how these models can break down questions, they can provide better assistance in various applications, such as education, customer service, and content creation.

Abstract

Question decomposition has emerged as an effective strategy for prompting Large Language Models (LLMs) to answer complex questions. However, while existing methods primarily focus on unimodal language models, the question decomposition capability of Multimodal Large Language Models (MLLMs) has yet to be explored. To this end, this paper explores visual question decomposition on MLLMs. Specifically, we introduce a systematic evaluation framework including a dataset and several evaluation criteria to assess the quality of the decomposed sub-questions, revealing that existing MLLMs struggle to produce high-quality sub-questions. To address this limitation, we propose a specific finetuning dataset, DecoVQA+, for enhancing the model's question decomposition capability. Aiming at enabling models to perform appropriate selective decomposition, we propose an efficient finetuning pipeline. The finetuning pipeline consists of our proposed dataset and a training objective for selective decomposition. Finetuned MLLMs demonstrate significant improvements in the quality of sub-questions and the policy of selective question decomposition. Additionally, the models also achieve higher accuracy with selective decomposition on VQA benchmark datasets.

View Paper