MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Ziyu Liu, Tao Chu, Yuhang Zang, Xilin Wei, Xiaoyi Dong, Pan Zhang, Zijian Liang, Yuanjun Xiong, Yu Qiao, Dahua Lin, Jiaqi Wang

2024-06-18

MMDU: A Multi-Turn Multi-Image Dialog Understanding Benchmark and Instruction-Tuning Dataset for LVLMs

Summary

This paper introduces MMDU, a new benchmark and dataset designed to improve how Large Vision-Language Models (LVLMs) understand and respond to conversations that involve multiple images and turns. It aims to help these AI models perform better in real-world scenarios where conversations are more complex.

What's the problem?

Current LVLMs perform well in simple situations, like answering questions based on one image or a short text. However, they struggle when faced with longer conversations that include many images and require understanding context over multiple turns. Existing benchmarks mostly test these models on straightforward tasks, which does not reflect the challenges they would face in real-life interactions.

What's the solution?

To address these issues, the authors created MMDU, which includes a large-scale instruction tuning dataset called MMDU-45k. This dataset is designed to evaluate how well LVLMs can handle multi-turn conversations with multiple images. They used a clustering algorithm to gather relevant images and text from Wikipedia and created question-answer pairs with the help of human annotators and the GPT-4o model. The MMDU benchmark allows for longer conversations (up to 27 turns) and includes up to 20 images, significantly increasing the complexity compared to previous benchmarks.

Why it matters?

This research is important because it provides a way to better assess and improve the capabilities of AI models in understanding complex dialogues that involve multiple images. By fine-tuning open-source LVLMs on this new dataset, researchers can help bridge the gap between current AI technology and the demands of real-world applications, making AI interactions more effective and meaningful.

Abstract

Generating natural and meaningful responses to communicate with multi-modal human inputs is a fundamental capability of Large Vision-Language Models(LVLMs). While current open-source LVLMs demonstrate promising performance in simplified scenarios such as single-turn single-image input, they fall short in real-world conversation scenarios such as following instructions in a long context history with multi-turn and multi-images. Existing LVLM benchmarks primarily focus on single-choice questions or short-form responses, which do not adequately assess the capabilities of LVLMs in real-world human-AI interaction applications. Therefore, we introduce MMDU, a comprehensive benchmark, and MMDU-45k, a large-scale instruction tuning dataset, designed to evaluate and improve LVLMs' abilities in multi-turn and multi-image conversations. We employ the clustering algorithm to ffnd the relevant images and textual descriptions from the open-source Wikipedia and construct the question-answer pairs by human annotators with the assistance of the GPT-4o model. MMDU has a maximum of 18k image+text tokens, 20 images, and 27 turns, which is at least 5x longer than previous benchmarks and poses challenges to current LVLMs. Our in-depth analysis of 15 representative LVLMs using MMDU reveals that open-source LVLMs lag behind closed-source counterparts due to limited conversational instruction tuning data. We demonstrate that ffne-tuning open-source LVLMs on MMDU-45k signiffcantly address this gap, generating longer and more accurate conversations, and improving scores on MMDU and existing benchmarks (MMStar: +1.1%, MathVista: +1.5%, ChartQA:+1.2%). Our contributions pave the way for bridging the gap between current LVLM models and real-world application demands. This project is available at https://github.com/Liuziyu77/MMDU.

View Paper