Relational Visual Similarity
Thao Nguyen, Sicheng Mo, Krishna Kumar Singh, Yilin Wang, Jing Shi, Nicholas Kolkin, Eli Shechtman, Yong Jae Lee, Yuheng Li
2025-12-09
Summary
This paper explores how computers 'see' similarity between images, arguing that current methods miss a key part of how humans do it. We don't just notice if things *look* alike, but if they have similar *relationships* between their parts, and this research tries to get computers to do the same.
What's the problem?
Current computer vision systems are really good at finding images with similar colors, shapes, and textures. However, they struggle to recognize when images are similar because of how their parts relate to each other. For example, a computer might not see the connection between the Earth and a peach – both have a layered structure with an outer layer, a middle part, and a core – even though a human easily would. This is a big limitation because understanding these relationships is crucial for human-level intelligence.
What's the solution?
The researchers created a new dataset of over 114,000 images, but instead of describing *what* is in the image, the descriptions focus on the *relationships* between things in the image. Then, they took an existing image-understanding model and 'trained' it using this new dataset, teaching it to focus on relational similarities. This allows the model to determine if two images are similar based on their underlying structure, even if they look very different.
Why it matters?
This work is important because it highlights a significant gap in how computers 'see' the world. By enabling computers to understand relational similarity, we can move closer to building AI systems that can reason and understand images more like humans do, which has potential applications in areas like robotics, image search, and even medical imaging.
Abstract
Humans do not just see attribute similarity -- we also see relational similarity. An apple is like a peach because both are reddish fruit, but the Earth is also like a peach: its crust, mantle, and core correspond to the peach's skin, flesh, and pit. This ability to perceive and recognize relational similarity, is arguable by cognitive scientist to be what distinguishes humans from other species. Yet, all widely used visual similarity metrics today (e.g., LPIPS, CLIP, DINO) focus solely on perceptual attribute similarity and fail to capture the rich, often surprising relational similarities that humans perceive. How can we go beyond the visible content of an image to capture its relational properties? How can we bring images with the same relational logic closer together in representation space? To answer these questions, we first formulate relational image similarity as a measurable problem: two images are relationally similar when their internal relations or functions among visual elements correspond, even if their visual attributes differ. We then curate 114k image-caption dataset in which the captions are anonymized -- describing the underlying relational logic of the scene rather than its surface content. Using this dataset, we finetune a Vision-Language model to measure the relational similarity between images. This model serves as the first step toward connecting images by their underlying relational structure rather than their visible appearance. Our study shows that while relational similarity has a lot of real-world applications, existing image similarity models fail to capture it -- revealing a critical gap in visual computing.