SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization

Hongrui Jia, Chaoya Jiang, Haiyang Xu, Wei Ye, Mengfan Dong, Ming Yan, Ji Zhang, Fei Huang, Shikun Zhang

2024-11-20

SymDPO: Boosting In-Context Learning of Large Multimodal Models with Symbol Demonstration Direct Preference Optimization

Summary

This paper introduces SymDPO, a new method designed to improve how large multimodal models learn from both text and images by using symbols in place of traditional text answers.

What's the problem?

Large multimodal models (LMMs) have the ability to process both visual and textual information, but they often struggle to effectively use the visual context when responding to questions. Instead of understanding the images, these models tend to rely too much on text patterns, which limits their ability to provide accurate answers based on the visual content.

What's the solution?

To solve this problem, the authors propose Symbol Demonstration Direct Preference Optimization (SymDPO). This method replaces traditional text answers in examples with random symbols that do not relate to the actual answers. This forces the model to focus on understanding the images better because it cannot rely on text alone to figure out the answers. By doing this, the model learns to connect visual information with symbolic responses, improving its ability to answer questions accurately based on what it sees.

Why it matters?

This research is important because it enhances how AI models understand and integrate different types of information. By improving the way LMMs learn from both images and text, SymDPO can lead to better performance in tasks that require a deep understanding of visual contexts, making these models more effective for real-world applications like medical diagnostics, autonomous driving, and more.

Abstract

As language models continue to scale, Large Language Models (LLMs) have exhibited emerging capabilities in In-Context Learning (ICL), enabling them to solve language tasks by prefixing a few in-context demonstrations (ICDs) as context. Inspired by these advancements, researchers have extended these techniques to develop Large Multimodal Models (LMMs) with ICL capabilities. However, existing LMMs face a critical issue: they often fail to effectively leverage the visual context in multimodal demonstrations and instead simply follow textual patterns. This indicates that LMMs do not achieve effective alignment between multimodal demonstrations and model outputs. To address this problem, we propose Symbol Demonstration Direct Preference Optimization (SymDPO). Specifically, SymDPO aims to break the traditional paradigm of constructing multimodal demonstrations by using random symbols to replace text answers within instances. This forces the model to carefully understand the demonstration images and establish a relationship between the images and the symbols to answer questions correctly. We validate the effectiveness of this method on multiple benchmarks, demonstrating that with SymDPO, LMMs can more effectively understand the multimodal context within examples and utilize this knowledge to answer questions better.

View Paper