3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering

Dewei Zhou, Ji Xie, Zongxin Yang, Yi Yang

2025-01-15

3DIS-FLUX: simple and efficient multi-instance generation with DiT rendering

Summary

This paper talks about 3DIS-FLUX, a new way to make AI create images with multiple objects in specific layouts. It's an improvement on an earlier method called 3DIS, and it uses a powerful AI model called FLUX to make better quality images more efficiently.

What's the problem?

Creating AI-generated images with multiple objects in specific positions is tricky. Current methods often need to be retrained whenever a new, better AI model comes out, which takes a lot of time and computer power. Also, these methods sometimes struggle to make the objects look exactly right or put them in the exact positions they're supposed to be in.

What's the solution?

The researchers improved on a method called 3DIS by adding a new AI model called FLUX. Their new method, 3DIS-FLUX, works in two steps. First, it creates a rough 3D map of where everything should go in the image. Then, it uses FLUX to fill in all the details and make the objects look right. They also came up with a clever way to make sure each part of the image matches its description perfectly. This new method doesn't need to be retrained from scratch every time a better AI model comes out, which saves a lot of time and resources.

Why it matters?

This matters because it could make it much easier and faster to create complex, customized images using AI. It could be really useful for things like creating illustrations, designing products, or making movie scenes. By making the process more efficient and improving the quality of the images, it opens up new possibilities for using AI in creative fields. It also shows how combining different AI techniques can lead to big improvements, which could inspire more advances in AI technology.

Abstract

The growing demand for controllable outputs in text-to-image generation has driven significant advancements in multi-instance generation (MIG), enabling users to define both instance layouts and attributes. Currently, the state-of-the-art methods in MIG are primarily adapter-based. However, these methods necessitate retraining a new adapter each time a more advanced model is released, resulting in significant resource consumption. A methodology named Depth-Driven Decoupled Instance Synthesis (3DIS) has been introduced, which decouples MIG into two distinct phases: 1) depth-based scene construction and 2) detail rendering with widely pre-trained depth control models. The 3DIS method requires adapter training solely during the scene construction phase, while enabling various models to perform training-free detail rendering. Initially, 3DIS focused on rendering techniques utilizing U-Net architectures such as SD1.5, SD2, and SDXL, without exploring the potential of recent DiT-based models like FLUX. In this paper, we present 3DIS-FLUX, an extension of the 3DIS framework that integrates the FLUX model for enhanced rendering capabilities. Specifically, we employ the FLUX.1-Depth-dev model for depth map controlled image generation and introduce a detail renderer that manipulates the Attention Mask in FLUX's Joint Attention mechanism based on layout information. This approach allows for the precise rendering of fine-grained attributes of each instance. Our experimental results indicate that 3DIS-FLUX, leveraging the FLUX model, outperforms the original 3DIS method, which utilized SD2 and SDXL, and surpasses current state-of-the-art adapter-based methods in terms of both performance and image quality. Project Page: https://limuloo.github.io/3DIS/.

View Paper