Controllable Human Image Generation with Personalized Multi-Garments

Yisol Choi, Sangkyung Kwak, Sihyun Yu, Hyungwon Choi, Jinwoo Shin

2024-11-27

Controllable Human Image Generation with Personalized Multi-Garments

Summary

This paper presents BootComp, a new system that generates realistic images of people wearing multiple garments based on text descriptions, allowing for more control over how the images look.

What's the problem?

Creating high-quality images of people in various outfits is difficult because it requires collecting a lot of photos of each garment worn by different individuals. Gathering this data is time-consuming and challenging, which makes it hard to train models that can generate these images accurately.

What's the solution?

To overcome this problem, the authors developed a method to create a large synthetic dataset that pairs human images with multiple garment images. They designed a model that can extract garment images from existing photos of people. They also implemented a filtering strategy to ensure that only high-quality garment images are used for training. Using this synthetic dataset, they trained a diffusion model that can generate detailed human images while allowing users to specify different garments.

Why it matters?

This research is important because it enhances the ability to create customizable images for fashion and e-commerce applications. By enabling more accurate and controllable image generation, BootComp can help businesses showcase their products more effectively and allow customers to visualize how different garments look on various body types.

Abstract

We present BootComp, a novel framework based on text-to-image diffusion models for controllable human image generation with multiple reference garments. Here, the main bottleneck is data acquisition for training: collecting a large-scale dataset of high-quality reference garment images per human subject is quite challenging, i.e., ideally, one needs to manually gather every single garment photograph worn by each human. To address this, we propose a data generation pipeline to construct a large synthetic dataset, consisting of human and multiple-garment pairs, by introducing a model to extract any reference garment images from each human image. To ensure data quality, we also propose a filtering strategy to remove undesirable generated data based on measuring perceptual similarities between the garment presented in human image and extracted garment. Finally, by utilizing the constructed synthetic dataset, we train a diffusion model having two parallel denoising paths that use multiple garment images as conditions to generate human images while preserving their fine-grained details. We further show the wide-applicability of our framework by adapting it to different types of reference-based generation in the fashion domain, including virtual try-on, and controllable human image generation with other conditions, e.g., pose, face, etc.

View Paper