Generating Multi-Image Synthetic Data for Text-to-Image Customization
Nupur Kumari, Xi Yin, Jun-Yan Zhu, Ishan Misra, Samaneh Azadi
2025-02-05
Summary
This paper talks about a new method called SynCD, which helps AI models generate better images based on text descriptions by using synthetic datasets of objects shown in different settings. It improves how well these models can customize and personalize images.
What's the problem?
Current methods for text-to-image customization either require expensive and time-consuming optimization or rely on training with single images, which limits the quality and flexibility of the generated images. These approaches struggle to accurately represent objects in various environments and poses.
What's the solution?
The researchers created SynCD, a synthetic dataset with multiple images of the same object in different lighting, backgrounds, and poses using 3D models. They also developed a new encoder architecture that captures fine details from input images and an inference technique that reduces overexposure during image generation. These improvements allow the model to generate high-quality, customized images efficiently.
Why it matters?
This research is important because it makes text-to-image customization more accurate, faster, and easier to use. By enabling better personalization and higher-quality outputs, it opens up new possibilities for creative applications in fields like design, marketing, and entertainment.
Abstract
Customization of text-to-image models enables users to insert custom concepts and generate the concepts in unseen settings. Existing methods either rely on costly test-time optimization or train encoders on single-image training datasets without multi-image supervision, leading to worse image quality. We propose a simple approach that addresses both limitations. We first leverage existing text-to-image models and 3D datasets to create a high-quality Synthetic Customization Dataset (SynCD) consisting of multiple images of the same object in different lighting, backgrounds, and poses. We then propose a new encoder architecture based on shared attention mechanisms that better incorporate fine-grained visual details from input images. Finally, we propose a new inference technique that mitigates overexposure issues during inference by normalizing the text and image guidance vectors. Through extensive experiments, we show that our model, trained on the synthetic dataset with the proposed encoder and inference algorithm, outperforms existing tuning-free methods on standard customization benchmarks.