StableSemantics: A Synthetic Language-Vision Dataset of Semantic Representations in Naturalistic Images
Rushikesh Zawar, Shaurya Dewan, Andrew F. Luo, Margaret M. Henderson, Michael J. Tarr, Leila Wehbe
2024-06-21

Summary
This paper introduces StableSemantics, a new dataset designed to improve how computer vision systems understand and identify objects in images by focusing on their meanings and relationships.
What's the problem?
Understanding visual scenes is challenging for computer vision because objects that have similar meanings can look very different. For example, a 'container' could be a box, a jar, or a bag, and these objects can vary greatly in shape and appearance. This variability makes it hard for computer vision models to accurately recognize and categorize these objects. Additionally, existing models often struggle with different lighting conditions and complex scenes where multiple objects interact.
What's the solution?
The researchers created the StableSemantics dataset, which includes 224,000 carefully curated prompts and over 2 million synthetic images generated from these prompts. Each image is paired with processed natural language captions and attention maps that show how the model focuses on different parts of the image when identifying objects. The dataset provides ten different image generations for each prompt to capture a wide range of visual variations. This comprehensive approach allows for better training of models to understand the semantics of images, improving their ability to recognize and categorize objects accurately.
Why it matters?
This research is important because it provides a valuable resource for advancing the field of computer vision. By focusing on the meanings behind visual representations, StableSemantics aims to enhance how AI systems interpret images, making them more effective in real-world applications like autonomous vehicles, security systems, and content creation. Ultimately, this dataset could lead to smarter AI that understands visuals more like humans do.
Abstract
Understanding the semantics of visual scenes is a fundamental challenge in Computer Vision. A key aspect of this challenge is that objects sharing similar semantic meanings or functions can exhibit striking visual differences, making accurate identification and categorization difficult. Recent advancements in text-to-image frameworks have led to models that implicitly capture natural scene statistics. These frameworks account for the visual variability of objects, as well as complex object co-occurrences and sources of noise such as diverse lighting conditions. By leveraging large-scale datasets and cross-attention conditioning, these models generate detailed and contextually rich scene representations. This capability opens new avenues for improving object recognition and scene understanding in varied and challenging environments. Our work presents StableSemantics, a dataset comprising 224 thousand human-curated prompts, processed natural language captions, over 2 million synthetic images, and 10 million attention maps corresponding to individual noun chunks. We explicitly leverage human-generated prompts that correspond to visually interesting stable diffusion generations, provide 10 generations per phrase, and extract cross-attention maps for each image. We explore the semantic distribution of generated images, examine the distribution of objects within images, and benchmark captioning and open vocabulary segmentation methods on our data. To the best of our knowledge, we are the first to release a diffusion dataset with semantic attributions. We expect our proposed dataset to catalyze advances in visual semantic understanding and provide a foundation for developing more sophisticated and effective visual models. Website: https://stablesemantics.github.io/StableSemantics