ControlNeXt: Powerful and Efficient Control for Image and Video Generation

Bohao Peng, Jian Wang, Yuechen Zhang, Wenbo Li, Ming-Chang Yang, Jiaya Jia

2024-08-13

ControlNeXt: Powerful and Efficient Control for Image and Video Generation

Summary

This paper introduces ControlNeXt, a new method that improves how images and videos are generated using AI by allowing for better control without needing a lot of extra computing power.

What's the problem?

Current methods for controlling how images and videos are generated often require a lot of additional resources, making them inefficient, especially for video generation. Additionally, these methods can struggle with training or may not provide strong control over the output.

What's the solution?

The authors propose ControlNeXt, which simplifies the architecture used for generating images and videos. Instead of adding heavy components, they use a more streamlined design that reduces the number of parameters needed for training by up to 90%. They also introduce a new technique called Cross Normalization to help the model learn faster and more reliably. This method was tested on various models and showed strong performance in generating high-quality outputs.

Why it matters?

This research is important because it makes image and video generation using AI more efficient and accessible. By reducing the computational demands, it allows more people to use advanced AI technologies for creative projects, research, and other applications without needing expensive hardware.

Abstract

Diffusion models have demonstrated remarkable and robust abilities in both image and video generation. To achieve greater control over generated results, researchers introduce additional architectures, such as ControlNet, Adapters and ReferenceNet, to integrate conditioning controls. However, current controllable generation methods often require substantial additional computational resources, especially for video generation, and face challenges in training or exhibit weak control. In this paper, we propose ControlNeXt: a powerful and efficient method for controllable image and video generation. We first design a more straightforward and efficient architecture, replacing heavy additional branches with minimal additional cost compared to the base model. Such a concise structure also allows our method to seamlessly integrate with other LoRA weights, enabling style alteration without the need for additional training. As for training, we reduce up to 90% of learnable parameters compared to the alternatives. Furthermore, we propose another method called Cross Normalization (CN) as a replacement for Zero-Convolution' to achieve fast and stable training convergence. We have conducted various experiments with different base models across images and videos, demonstrating the robustness of our method.

View Paper