ComfyGen: Prompt-Adaptive Workflows for Text-to-Image Generation
Rinon Gal, Adi Haviv, Yuval Alaluf, Amit H. Bermano, Daniel Cohen-Or, Gal Chechik
2024-10-03

Summary
This paper introduces ComfyGen, a new system that improves text-to-image generation by automatically creating workflows tailored to specific user prompts.
What's the problem?
Generating images from text prompts can be challenging because traditional models often use a single, fixed approach that doesn't adapt well to different types of prompts. This can lead to lower quality images since these models may not effectively capture the unique requirements of each prompt.
What's the solution?
ComfyGen addresses this problem by using two methods to create adaptive workflows. The first method, called ComfyGen-IC, selects workflows based on past performance and how well they match the current prompt. The second method, ComfyGen-FT, fine-tunes the model using a wider range of data, including both successful and unsuccessful workflows. This allows the system to learn what works best for generating high-quality images based on the prompt provided. Both methods showed improved image quality compared to traditional models.
Why it matters?
This research is important because it makes text-to-image generation more efficient and accessible for users who may not have the technical skills to create complex workflows themselves. By automating this process, ComfyGen enables more people to generate high-quality images tailored to their specific needs, which can be beneficial in fields like marketing, education, and content creation.
Abstract
The practical use of text-to-image generation has evolved from simple, monolithic models to complex workflows that combine multiple specialized components. While workflow-based approaches can lead to improved image quality, crafting effective workflows requires significant expertise, owing to the large number of available components, their complex inter-dependence, and their dependence on the generation prompt. Here, we introduce the novel task of prompt-adaptive workflow generation, where the goal is to automatically tailor a workflow to each user prompt. We propose two LLM-based approaches to tackle this task: a tuning-based method that learns from user-preference data, and a training-free method that uses the LLM to select existing flows. Both approaches lead to improved image quality when compared to monolithic models or generic, prompt-independent workflows. Our work shows that prompt-dependent flow prediction offers a new pathway to improving text-to-image generation quality, complementing existing research directions in the field.