Dual Caption Preference Optimization for Diffusion Models

Amir Saeidi, Yiran Luo, Agneet Chatterjee, Shamanthak Hegde, Bimsara Pathiraja, Yezhou Yang, Chitta Baral

2025-02-11

Dual Caption Preference Optimization for Diffusion Models

Summary

This paper talks about a new method called Dual Caption Preference Optimization (DCPO) that improves how AI creates images from text descriptions. It helps the AI understand what people prefer in images and makes the generated images better match what people want.

What's the problem?

Current AI systems that create images from text descriptions have two main issues. First, they sometimes get confused about what makes a good image versus a bad one. Second, the text descriptions given to the AI can contain information that's not helpful for making the image, which can lead to mistakes.

What's the solution?

The researchers created DCPO, which uses two different captions for each image - one for good images and one for less good ones. They also made a new dataset called Pick-Double Caption with these dual captions. To create these captions, they came up with three different methods: writing new captions, slightly changing existing ones, or a mix of both. This helps the AI better understand what makes a good image and ignore unhelpful information in the text descriptions.

Why it matters?

This matters because it makes AI-generated images look better and more closely match what people ask for. It could lead to better AI art tools, more realistic computer-generated graphics for games or movies, and improved visual content for things like advertising or education. By making AI-generated images more in line with human preferences, it could also make these tools more useful and appealing to a wider range of people.

Abstract

Recent advancements in human preference optimization, originally developed for Large Language Models (LLMs), have shown significant potential in improving text-to-image diffusion models. These methods aim to learn the distribution of preferred samples while distinguishing them from less preferred ones. However, existing preference datasets often exhibit overlap between these distributions, leading to a conflict distribution. Additionally, we identified that input prompts contain irrelevant information for less preferred images, limiting the denoising network's ability to accurately predict noise in preference optimization methods, known as the irrelevant prompt issue. To address these challenges, we propose Dual Caption Preference Optimization (DCPO), a novel approach that utilizes two distinct captions to mitigate irrelevant prompts. To tackle conflict distribution, we introduce the Pick-Double Caption dataset, a modified version of Pick-a-Pic v2 with separate captions for preferred and less preferred images. We further propose three different strategies for generating distinct captions: captioning, perturbation, and hybrid methods. Our experiments show that DCPO significantly improves image quality and relevance to prompts, outperforming Stable Diffusion (SD) 2.1, SFT_Chosen, Diffusion-DPO, and MaPO across multiple metrics, including Pickscore, HPSv2.1, GenEval, CLIPscore, and ImageReward, fine-tuned on SD 2.1 as the backbone.

View Paper