mPLUG-Owl3: Towards Long Image-Sequence Understanding in Multi-Modal Large Language Models
Jiabo Ye, Haiyang Xu, Haowei Liu, Anwen Hu, Ming Yan, Qi Qian, Ji Zhang, Fei Huang, Jingren Zhou
2024-08-12

Summary
This paper presents mPLUG-Owl3, a new model designed to improve how machines understand and process long sequences of images, videos, and text together.
What's the problem?
While existing multimodal language models (MLLMs) are good at handling single images or short video clips, they struggle with understanding longer sequences of images and videos. This limitation makes it difficult for these models to perform well in tasks that require analyzing multiple images or lengthy videos, which are common in real-world applications.
What's the solution?
The authors developed mPLUG-Owl3, which uses advanced techniques called hyper attention blocks to better integrate visual and textual information. This allows the model to focus on important details across long sequences of images and videos. They also created a new evaluation method called Distractor Resistance to test how well the model can maintain focus when there are distractions in the data. Their experiments show that mPLUG-Owl3 outperforms other models in understanding both single and multiple images as well as videos.
Why it matters?
This research is significant because it enhances the ability of AI systems to interpret complex visual information over extended periods. By improving how machines understand long image sequences, mPLUG-Owl3 can contribute to advancements in fields like video analysis, automated content creation, and interactive AI applications, making technology more effective and user-friendly.
Abstract
Multi-modal Large Language Models (MLLMs) have demonstrated remarkable capabilities in executing instructions for a variety of single-image tasks. Despite this progress, significant challenges remain in modeling long image sequences. In this work, we introduce the versatile multi-modal large language model, mPLUG-Owl3, which enhances the capability for long image-sequence understanding in scenarios that incorporate retrieved image-text knowledge, interleaved image-text, and lengthy videos. Specifically, we propose novel hyper attention blocks to efficiently integrate vision and language into a common language-guided semantic space, thereby facilitating the processing of extended multi-image scenarios. Extensive experimental results suggest that mPLUG-Owl3 achieves state-of-the-art performance among models with a similar size on single-image, multi-image, and video benchmarks. Moreover, we propose a challenging long visual sequence evaluation named Distractor Resistance to assess the ability of models to maintain focus amidst distractions. Finally, with the proposed architecture, mPLUG-Owl3 demonstrates outstanding performance on ultra-long visual sequence inputs. We hope that mPLUG-Owl3 can contribute to the development of more efficient and powerful multimodal large language models.