Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

Brandon Huang, Chancharik Mitra, Assaf Arbelle, Leonid Karlinsky, Trevor Darrell, Roei Herzig

2024-06-27

Multimodal Task Vectors Enable Many-Shot Multimodal In-Context Learning

Summary

This paper discusses a new method called Multimodal Task Vectors (MTVs) that helps large multimodal models (LMMs) learn from many examples at once without needing to fine-tune the model. This method allows the models to handle tasks involving both text and images more efficiently.

What's the problem?

Large multimodal models are great at learning from a few examples, but they face a significant limitation: the amount of information they can process at one time is restricted by their context length, which is determined during their initial training. This is particularly challenging for tasks that involve both text and images because these tasks require more data, or 'tokens,' to be effective. As a result, models struggle to learn effectively when given many examples.

What's the solution?

To solve this problem, the authors introduced Multimodal Task Vectors (MTVs), which are compact representations of examples that can be stored in the model's attention heads. By using MTVs, the model can compress many examples into fewer tokens, allowing it to learn from a larger number of examples without exceeding its context length. The authors demonstrated that this approach enables the model to perform better on various vision-and-language tasks and that it can generalize well to similar tasks without needing more context during inference.

Why it matters?

This research is important because it improves how large language models learn from multimodal data, making them more efficient and effective. By allowing these models to process many examples simultaneously, we can enhance their performance on complex tasks that involve both text and images, which could lead to better applications in areas like education, content creation, and interactive AI systems.

Abstract

The recent success of interleaved Large Multimodal Models (LMMs) in few-shot learning suggests that in-context learning (ICL) with many examples can be promising for learning new tasks. However, this many-shot multimodal ICL setting has one crucial problem: it is fundamentally limited by the model's context length set at pretraining. The problem is especially prominent in the multimodal domain, which processes both text and images, requiring additional tokens. This motivates the need for a multimodal method to compress many shots into fewer tokens without finetuning. In this work, we enable LMMs to perform multimodal, many-shot in-context learning by leveraging Multimodal Task Vectors (MTV)--compact implicit representations of in-context examples compressed in the model's attention heads. Specifically, we first demonstrate the existence of such MTV in LMMs and then leverage these extracted MTV to enable many-shot in-context learning for various vision-and-language tasks. Our experiments suggest that MTV can scale in performance with the number of compressed shots and generalize to similar out-of-domain tasks without additional context length for inference.

View Paper