Task Vectors are Cross-Modal
Grace Luo, Trevor Darrell, Amir Bar
2024-10-30

Summary
This paper explores how vision-and-language models (VLMs) represent tasks across different types of input, like text and images, and finds that similar tasks are represented in similar ways, regardless of their format.
What's the problem?
Understanding how VLMs process and represent tasks is important for improving their performance. However, current methods do not clearly explain how these models handle different types of inputs or how they can generalize knowledge from one type of task to another. This lack of understanding can limit the effectiveness of VLMs in real-world applications.
What's the solution?
The authors investigate the internal workings of VLMs and discover that they go through three main phases when processing information: input, task, and answer. They find that tasks specified through text or images are mapped to similar 'task vectors,' which are internal representations that help the model understand what it needs to do. Additionally, they show that combining task vectors derived from examples and instructions leads to better overall task representations, allowing the model to perform effectively across different modalities.
Why it matters?
This research is significant because it sheds light on how VLMs can learn and adapt to various tasks using different types of input. By understanding these internal representations better, researchers can improve the design of VLMs, making them more flexible and effective in handling complex tasks in diverse fields such as education, healthcare, and content creation.
Abstract
We investigate the internal representations of vision-and-language models (VLMs) and how they encode task representations. We consider tasks specified through examples or instructions, using either text or image inputs. Surprisingly, we find that conceptually similar tasks are mapped to similar task vector representations, regardless of how they are specified. Our findings suggest that to output answers, tokens in VLMs undergo three distinct phases: input, task, and answer, a process which is consistent across different modalities and specifications. The task vectors we identify in VLMs are general enough to be derived in one modality (e.g., text) and transferred to another (e.g., image). Additionally, we find that ensembling exemplar and instruction based task vectors produce better task representations. Taken together, these insights shed light on the underlying mechanisms of VLMs, particularly their ability to represent tasks in a shared manner across different modalities and task specifications. Project page: https://task-vectors-are-cross-modal.github.io.