CSGO: Content-Style Composition in Text-to-Image Generation
Peng Xing, Haofan Wang, Yanpeng Sun, Qixun Wang, Xu Bai, Hao Ai, Renyuan Huang, Zechao Li
2024-08-30

Summary
This paper discusses CSGO, a new method for generating images that combines different styles using a large dataset of stylized images.
What's the problem?
Creating images with specific styles can be challenging because existing methods often rely on limited data and can produce inconsistent results. There is a need for better techniques to transfer styles from one image to another.
What's the solution?
The authors developed a data construction pipeline that generates and cleans up image triplets, which are sets of three images showing the same content in different styles. They created the IMAGStyle dataset, which includes 210,000 of these triplets. Using this dataset, they introduced CSGO, a model that separates content and style features effectively, allowing for high-quality style transfer based on images or text prompts.
Why it matters?
This research is important because it provides a large-scale resource for studying style transfer in images, which can enhance various applications like digital art, graphic design, and even video game graphics. By improving how styles are transferred, it opens up new creative possibilities for artists and designers.
Abstract
The diffusion model has shown exceptional capabilities in controlled image generation, which has further fueled interest in image style transfer. Existing works mainly focus on training free-based methods (e.g., image inversion) due to the scarcity of specific data. In this study, we present a data construction pipeline for content-style-stylized image triplets that generates and automatically cleanses stylized data triplets. Based on this pipeline, we construct a dataset IMAGStyle, the first large-scale style transfer dataset containing 210k image triplets, available for the community to explore and research. Equipped with IMAGStyle, we propose CSGO, a style transfer model based on end-to-end training, which explicitly decouples content and style features employing independent feature injection. The unified CSGO implements image-driven style transfer, text-driven stylized synthesis, and text editing-driven stylized synthesis. Extensive experiments demonstrate the effectiveness of our approach in enhancing style control capabilities in image generation. Additional visualization and access to the source code can be located on the project page: https://csgo-gen.github.io/.