Learning Flow Fields in Attention for Controllable Person Image Generation

Zijian Zhou, Shikun Liu, Xiao Han, Haozhe Liu, Kam Woh Ng, Tian Xie, Yuren Cong, Hang Li, Mengmeng Xu, Juan-Manuel Pérez-Rúa, Aditya Patel, Tao Xiang, Miaojing Shi, Sen He

2024-12-12

Learning Flow Fields in Attention for Controllable Person Image Generation

Summary

This paper talks about a new method called Learning Flow Fields in Attention (Leffa) that improves how AI generates images of people based on reference images, allowing for better control over their appearance and pose.

What's the problem?

When AI models generate images of people from reference images, they often lose important details and distort the textures of the original images. This happens because the models do not pay enough attention to the specific areas in the reference images that are important for creating accurate results, especially when trying to change how a person looks or poses.

What's the solution?

To solve this problem, the authors developed Leffa, which helps guide the AI model to focus on the correct parts of the reference image during training. This is done by adding a special loss function that encourages the model to pay attention to the right areas in the attention layer. As a result, Leffa significantly reduces distortion in fine details while maintaining high overall image quality. The method has been tested extensively and shows excellent performance in tasks like virtual try-on (changing clothing) and pose transfer (changing body positions).

Why it matters?

This research is important because it enhances the ability of AI to generate realistic images of people with precise control over their appearance and poses. By improving how models handle details, Leffa can be applied in various fields such as fashion, gaming, and film, making it easier to create high-quality visual content that matches user specifications.

Abstract

Controllable person image generation aims to generate a person image conditioned on reference images, allowing precise control over the person's appearance or pose. However, prior methods often distort fine-grained textural details from the reference image, despite achieving high overall image quality. We attribute these distortions to inadequate attention to corresponding regions in the reference image. To address this, we thereby propose learning flow fields in attention (Leffa), which explicitly guides the target query to attend to the correct reference key in the attention layer during training. Specifically, it is realized via a regularization loss on top of the attention map within a diffusion-based baseline. Our extensive experiments show that Leffa achieves state-of-the-art performance in controlling appearance (virtual try-on) and pose (pose transfer), significantly reducing fine-grained detail distortion while maintaining high image quality. Additionally, we show that our loss is model-agnostic and can be used to improve the performance of other diffusion models.

View Paper