Analyzing The Language of Visual Tokens
David M. Chan, Rodolfo Corona, Joonyong Park, Cheol Jun Cho, Yutong Bai, Trevor Darrell
2024-11-08

Summary
This paper discusses how to analyze the language used in visual tokens, which are pieces of images treated like words in text, to better understand how they relate to natural language.
What's the problem?
With the rise of transformer models that connect images and text, there is a need to understand how visual tokens behave. Researchers want to know if these visual tokens follow similar patterns and rules as natural languages, but there hasn't been much research on this topic.
What's the solution?
The authors conducted experiments to compare visual languages (the language of images) with natural languages. They found that while visual languages follow some similar statistical patterns, like Zipfian distributions (where a few items are very common and many are rare), they lack the grammatical structure that natural languages have. This means that visual tokens can represent parts of objects but don't form coherent sentences like words do. The study highlights both the similarities and differences between visual and natural languages, providing insights that could help improve computer vision models.
Why it matters?
Understanding the language of visual tokens is important because it can help researchers design better models for interpreting images and videos. By learning how visual information is structured, we can enhance AI systems that need to process both text and images, making them more effective in applications like image recognition, video analysis, and human-computer interaction.
Abstract
With the introduction of transformer-based models for vision and language tasks, such as LLaVA and Chameleon, there has been renewed interest in the discrete tokenized representation of images. These models often treat image patches as discrete tokens, analogous to words in natural language, learning joint alignments between visual and human languages. However, little is known about the statistical behavior of these visual languages - whether they follow similar frequency distributions, grammatical structures, or topologies as natural languages. In this paper, we take a natural-language-centric approach to analyzing discrete visual languages and uncover striking similarities and fundamental differences. We demonstrate that, although visual languages adhere to Zipfian distributions, higher token innovation drives greater entropy and lower compression, with tokens predominantly representing object parts, indicating intermediate granularity. We also show that visual languages lack cohesive grammatical structures, leading to higher perplexity and weaker hierarchical organization compared to natural languages. Finally, we demonstrate that, while vision models align more closely with natural languages than other models, this alignment remains significantly weaker than the cohesion found within natural languages. Through these experiments, we demonstrate how understanding the statistical properties of discrete visual languages can inform the design of more effective computer vision models.