Taking Notes Brings Focus? Towards Multi-Turn Multimodal Dialogue Learning
Jiazheng Liu, Sipeng Zheng, Börje F. Karlsson, Zongqing Lu
2025-03-11
Summary
This paper talks about teaching AI to handle real-like conversations that mix images and text by creating a special training set (MMDiag) and a new model (DiagNote) that thinks step-by-step and focuses on specific parts of images during chats.
What's the problem?
Current AI models are trained on simple one-question tasks about images, making them bad at real conversations where people ask follow-up questions and focus on different parts of a picture.
What's the solution?
The researchers built a multi-turn chat dataset (MMDiag) using smart rules and AI helpers, then made DiagNote—a model with two parts that work together to think through answers and track visual details across conversation turns.
Why it matters?
This helps AI assistants like chatbots handle complex image-based conversations better, like explaining medical scans step-by-step or teaching with diagrams in online tutoring.
Abstract
Multimodal large language models (MLLMs), built on large-scale pre-trained vision towers and language models, have shown great capabilities in multimodal understanding. However, most existing MLLMs are trained on single-turn vision question-answering tasks, which do not accurately reflect real-world human conversations. In this paper, we introduce MMDiag, a multi-turn multimodal dialogue dataset. This dataset is collaboratively generated through deliberately designed rules and GPT assistance, featuring strong correlations between questions, between questions and images, and among different image regions; thus aligning more closely with real-world scenarios. MMDiag serves as a strong benchmark for multi-turn multimodal dialogue learning and brings more challenges to the grounding and reasoning capabilities of MLLMs. Further, inspired by human vision processing, we present DiagNote, an MLLM equipped with multimodal grounding and reasoning capabilities. DiagNote consists of two modules (Deliberate and Gaze) interacting with each other to perform Chain-of-Thought and annotations respectively, throughout multi-turn dialogues. We empirically demonstrate the advantages of DiagNote in both grounding and jointly processing and reasoning with vision and language information over existing MLLMs.