LongLLaVA: Scaling Multi-modal LLMs to 1000 Images Efficiently via Hybrid Architecture
Xidong Wang, Dingjie Song, Shunian Chen, Chen Zhang, Benyou Wang
2024-09-05

Summary
This paper talks about LongLLaVA, a new model that improves how multi-modal large language models (MLLMs) can understand and process up to 1000 images efficiently.
What's the problem?
As technology advances, it's important for MLLMs to handle long contexts, especially when dealing with video content and high-resolution images. However, existing models often struggle with performance when processing many images at once, and they can be very resource-intensive, making them hard to use effectively.
What's the solution?
LongLLaVA addresses these issues by combining different model architectures (Mamba and Transformer blocks) and optimizing how data is constructed and trained. This hybrid approach allows the model to learn from both the spatial (how images are arranged) and temporal (how images change over time) aspects of multiple images. The result is a model that can efficiently process nearly 1000 images using a single powerful GPU while maintaining high performance across various tasks.
Why it matters?
This research is important because it enhances the capabilities of AI in understanding complex visual information. By improving the efficiency of MLLMs, LongLLaVA can be applied to a wide range of tasks, including video analysis and multi-modal applications, making advanced AI tools more accessible and effective.
Abstract
Expanding the long-context capabilities of Multi-modal Large Language Models~(MLLMs) is crucial for video understanding, high-resolution image understanding, and multi-modal agents. This involves a series of systematic optimizations, including model architecture, data construction and training strategy, particularly addressing challenges such as degraded performance with more images and high computational costs. In this paper, we adapt the model architecture to a hybrid of Mamba and Transformer blocks, approach data construction with both temporal and spatial dependencies among multiple images and employ a progressive training strategy. The released model LongLLaVA~(Long-Context Large Language and Vision Assistant) is the first hybrid MLLM, which achieved a better balance between efficiency and effectiveness. LongLLaVA not only achieves competitive results across various benchmarks, but also maintains high throughput and low memory consumption. Especially, it could process nearly a thousand images on a single A100 80GB GPU, showing promising application prospects for a wide range of tasks.