Boost Your Own Human Image Generation Model via Direct Preference Optimization with AI Feedback
Sanghyeon Na, Yonggyu Kim, Hyunjoon Lee
2025-04-03
Summary
This paper is about improving how AI generates realistic images of humans by using a technique that learns from AI feedback instead of needing human opinions.
What's the problem?
Creating AI-generated images of humans that look realistic is hard because it's difficult to meet all the requirements related to human pose, anatomy, and how well the image matches the text description.
What's the solution?
The researchers developed a method that uses AI feedback to train the image generation model, which avoids the cost of getting human feedback. They also made changes to the training process to reduce unwanted artifacts and improve image quality.
Why it matters?
This work matters because it can lead to better AI models that generate more realistic and accurate images of humans, which could have applications in areas like virtual reality and content creation.
Abstract
The generation of high-quality human images through text-to-image (T2I) methods is a significant yet challenging task. Distinct from general image generation, human image synthesis must satisfy stringent criteria related to human pose, anatomy, and alignment with textual prompts, making it particularly difficult to achieve realistic results. Recent advancements in T2I generation based on diffusion models have shown promise, yet challenges remain in meeting human-specific preferences. In this paper, we introduce a novel approach tailored specifically for human image generation utilizing Direct Preference Optimization (DPO). Specifically, we introduce an efficient method for constructing a specialized DPO dataset for training human image generation models without the need for costly human feedback. We also propose a modified loss function that enhances the DPO training process by minimizing artifacts and improving image fidelity. Our method demonstrates its versatility and effectiveness in generating human images, including personalized text-to-image generation. Through comprehensive evaluations, we show that our approach significantly advances the state of human image generation, achieving superior results in terms of natural anatomies, poses, and text-image alignment.