LAION-SG: An Enhanced Large-Scale Dataset for Training Complex Image-Text Models with Structural Annotations
Zejian Li, Chenye Meng, Yize Li, Ling Yang, Shengyuan Zhang, Jiarui Ma, Jiayi Li, Guang Yang, Changyuan Yang, Zhiyuan Yang, Jinxiong Chang, Lingyun Sun
2024-12-12

Summary
This paper talks about LAION-SG, a new large-scale dataset designed to improve the ability of text-to-image models to generate complex images that include multiple objects and their relationships.
What's the problem?
Current text-to-image (T2I) models struggle to create detailed images when the prompts involve multiple objects and complex interactions. This is largely due to existing datasets lacking precise annotations that describe how different objects relate to each other, making it hard for models to understand and generate these intricate scenes accurately.
What's the solution?
To solve this problem, the authors created LAION-SG, a dataset that includes high-quality structural annotations using scene graphs. These annotations provide detailed information about the attributes and relationships of various objects in a scene. They also trained a new model called SDXL-SG using this dataset, which helps incorporate this structural information into the image generation process. Additionally, they introduced CompSG-Bench, a benchmark for evaluating how well models can generate complex images.
Why it matters?
This research is important because it sets a new standard for training models that generate images from text prompts. By providing a dataset that captures the complexity of real-world scenes, LAION-SG helps improve the accuracy and quality of generated images, making it useful for applications in art, design, and virtual environments.
Abstract
Recent advances in text-to-image (T2I) generation have shown remarkable success in producing high-quality images from text. However, existing T2I models show decayed performance in compositional image generation involving multiple objects and intricate relationships. We attribute this problem to limitations in existing datasets of image-text pairs, which lack precise inter-object relationship annotations with prompts only. To address this problem, we construct LAION-SG, a large-scale dataset with high-quality structural annotations of scene graphs (SG), which precisely describe attributes and relationships of multiple objects, effectively representing the semantic structure in complex scenes. Based on LAION-SG, we train a new foundation model SDXL-SG to incorporate structural annotation information into the generation process. Extensive experiments show advanced models trained on our LAION-SG boast significant performance improvements in complex scene generation over models on existing datasets. We also introduce CompSG-Bench, a benchmark that evaluates models on compositional image generation, establishing a new standard for this domain.