EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

Rui Zhao, Hangjie Yuan, Yujie Wei, Shiwei Zhang, Yuchao Gu, Lingmin Ran, Xiang Wang, Zhangjie Wu, Junhao Zhang, Yingya Zhang, Mike Zheng Shou

2024-10-14

EvolveDirector: Approaching Advanced Text-to-Image Generation with Large Vision-Language Models

Summary

This paper introduces EvolveDirector, a new framework designed to create high-quality images from text descriptions by using publicly available resources and advanced models through APIs.

What's the problem?

Many advanced text-to-image generation models are trained on high-quality, proprietary data and often do not share their internal workings or parameters. This makes it difficult for others to replicate their success or improve upon them without access to expensive resources or data.

What's the solution?

EvolveDirector aims to overcome these challenges by interacting with advanced models via their public APIs to gather text-image pairs for training a new base model. It uses a large amount of generated data (over 10 million samples) to train effectively. To make this process more efficient, it employs pre-trained vision-language models (VLMs) that guide the training of the base model, helping to refine the dataset dynamically and reduce the amount of data needed for training. This allows EvolveDirector to select the best samples from various advanced models, improving the overall performance of the generated images.

Why it matters?

This research is important because it democratizes access to powerful image generation technology by using publicly available resources. By making it easier and cheaper to train high-quality text-to-image models, EvolveDirector can help more people and organizations create innovative applications in art, design, and other fields that rely on visual content.

Abstract

Recent advancements in generation models have showcased remarkable capabilities in generating fantastic content. However, most of them are trained on proprietary high-quality data, and some models withhold their parameters and only provide accessible application programming interfaces (APIs), limiting their benefits for downstream tasks. To explore the feasibility of training a text-to-image generation model comparable to advanced models using publicly available resources, we introduce EvolveDirector. This framework interacts with advanced models through their public APIs to obtain text-image data pairs to train a base model. Our experiments with extensive data indicate that the model trained on generated data of the advanced model can approximate its generation capability. However, it requires large-scale samples of 10 million or more. This incurs significant expenses in time, computational resources, and especially the costs associated with calling fee-based APIs. To address this problem, we leverage pre-trained large vision-language models (VLMs) to guide the evolution of the base model. VLM continuously evaluates the base model during training and dynamically updates and refines the training dataset by the discrimination, expansion, deletion, and mutation operations. Experimental results show that this paradigm significantly reduces the required data volume. Furthermore, when approaching multiple advanced models, EvolveDirector can select the best samples generated by them to learn powerful and balanced abilities. The final trained model Edgen is demonstrated to outperform these advanced models. The code and model weights are available at https://github.com/showlab/EvolveDirector.

View Paper