Do Vision and Language Models Share Concepts? A Vector Space Alignment Study

Jiaang Li, Yova Kementchedjhieva, Constanza Fierro, Anders Søgaard

2024-07-11

Do Vision and Language Models Share Concepts? A Vector Space Alignment Study

Summary

This paper investigates whether large language models (LMs) and vision models share similar concepts or representations. It examines how these models understand and connect information from language and images.

What's the problem?

Language models are often criticized for not being able to relate their understanding of words and sentences to real-world concepts, meaning they lack a 'mental model' of the world. This raises questions about whether they can effectively work alongside vision models, which process images.

What's the solution?

The researchers conducted experiments comparing four types of language models (like BERT and GPT-2) with three types of vision models (like ResNet and SegFormer). They looked at how the representations created by LMs might align with those created by vision models. Their findings suggest that, despite some differences, LMs do share some similarities with vision models in how they represent concepts, though this can be affected by factors like the variability of the data and the meanings of words.

Why it matters?

Understanding whether LMs and vision models share concepts is crucial for improving how these models work together in tasks that involve both language and images. This research could lead to better multimodal AI systems that can interpret and respond to information more effectively, enhancing applications like visual question answering and image captioning.

Abstract

Large-scale pretrained language models (LMs) are said to ``lack the ability to connect utterances to the world'' (Bender and Koller, 2020), because they do not have ``mental models of the world' '(Mitchell and Krakauer, 2023). If so, one would expect LM representations to be unrelated to representations induced by vision models. We present an empirical evaluation across four families of LMs (BERT, GPT-2, OPT and LLaMA-2) and three vision model architectures (ResNet, SegFormer, and MAE). Our experiments show that LMs partially converge towards representations isomorphic to those of vision models, subject to dispersion, polysemy and frequency. This has important implications for both multi-modal processing and the LM understanding debate (Mitchell and Krakauer, 2023).

View Paper