SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs

Yuanyang Yin, Yaqi Zhao, Yajie Zhang, Ke Lin, Jiahao Wang, Xin Tao, Pengfei Wan, Di Zhang, Baoqun Yin, Wentao Zhang

2024-08-23

SEA: Supervised Embedding Alignment for Token-Level Visual-Textual Integration in MLLMs

Summary

This paper discusses a method called Supervised Embedding Alignment (SEA) that helps improve how Multimodal Large Language Models (MLLMs) combine visual and textual information.

What's the problem?

MLLMs, which are designed to understand both images and text, often struggle with aligning these two types of data when they are trained using image-level supervision. This misalignment can reduce the effectiveness of the models, making it harder for them to perform well in tasks that involve both visual and language understanding.

What's the solution?

To solve this problem, the authors introduced SEA, a method that aligns visual tokens (parts of images) with the language model's understanding through a technique called contrastive learning. This approach uses pre-trained models like CLIP to ensure that the visual and textual information work together better, leading to improved performance and easier interpretation of the models' outputs.

Why it matters?

This research is significant because it enhances the capabilities of MLLMs, particularly smaller models, without needing extra data or more complex computations. By improving how these models integrate visual and textual information, SEA can lead to more reliable AI systems that can better understand and respond to complex queries involving both images and text.

Abstract

Multimodal Large Language Models (MLLMs) have recently demonstrated remarkable perceptual and reasoning abilities, typically comprising a Vision Encoder, an Adapter, and a Large Language Model (LLM). The adapter serves as the critical bridge between the visual and language components. However, training adapters with image-level supervision often results in significant misalignment, undermining the LLMs' capabilities and limiting the potential of Multimodal LLMs. To address this, we introduce Supervised Embedding Alignment (SEA), a token-level alignment method that leverages vision-language pre-trained models, such as CLIP, to align visual tokens with the LLM's embedding space through contrastive learning. This approach ensures a more coherent integration of visual and language representations, enhancing the performance and interpretability of multimodal LLMs while preserving their inherent capabilities. Extensive experiments show that SEA effectively improves MLLMs, particularly for smaller models, without adding extra data or inference computation. SEA also lays the groundwork for developing more general and adaptable solutions to enhance multimodal systems.

View Paper