ACE: All-round Creator and Editor Following Instructions via Diffusion Transformer
Zhen Han, Zeyinzi Jiang, Yulin Pan, Jingfeng Zhang, Chaojie Mao, Chenwei Xie, Yu Liu, Jingren Zhou
2024-10-02

Summary
This paper discusses ACE, a new model that can create and edit images based on text instructions, using advanced diffusion technology to handle various visual tasks.
What's the problem?
Most existing models for generating images from text are limited because they only work with one type of input at a time. This makes it hard to perform complex visual editing tasks that require understanding multiple types of information, similar to how GPT-4 works for text. As a result, these models cannot serve as a single solution for all visual generation needs.
What's the solution?
ACE addresses this issue by introducing a new input format called the Long-context Condition Unit (LCU), which allows the model to process multiple types of inputs together. It also uses a novel Transformer-based diffusion model that can learn from various image generation and editing tasks simultaneously. To gather the necessary training data, the authors developed a method to create pairs of images and text instructions using advanced techniques. The model was evaluated against other leading models and showed superior performance in generating and editing images.
Why it matters?
This research is important because it creates a versatile tool for generating and editing images, making it easier for users to create custom visuals based on their specific needs. By combining different types of inputs into one model, ACE simplifies the process of visual creation and opens up new possibilities for applications in art, design, and interactive media.
Abstract
Diffusion models have emerged as a powerful generative technology and have been found to be applicable in various scenarios. Most existing foundational diffusion models are primarily designed for text-guided visual generation and do not support multi-modal conditions, which are essential for many visual editing tasks. This limitation prevents these foundational diffusion models from serving as a unified model in the field of visual generation, like GPT-4 in the natural language processing field. In this work, we propose ACE, an All-round Creator and Editor, which achieves comparable performance compared to those expert models in a wide range of visual generation tasks. To achieve this goal, we first introduce a unified condition format termed Long-context Condition Unit (LCU), and propose a novel Transformer-based diffusion model that uses LCU as input, aiming for joint training across various generation and editing tasks. Furthermore, we propose an efficient data collection approach to address the issue of the absence of available training data. It involves acquiring pairwise images with synthesis-based or clustering-based pipelines and supplying these pairs with accurate textual instructions by leveraging a fine-tuned multi-modal large language model. To comprehensively evaluate the performance of our model, we establish a benchmark of manually annotated pairs data across a variety of visual generation tasks. The extensive experimental results demonstrate the superiority of our model in visual generation fields. Thanks to the all-in-one capabilities of our model, we can easily build a multi-modal chat system that responds to any interactive request for image creation using a single model to serve as the backend, avoiding the cumbersome pipeline typically employed in visual agents. Code and models will be available on the project page: https://ali-vilab.github.io/ace-page/.