The DiffSensei framework consists of two stages. In the first stage, a multi-character customized manga image generation model with layout control is trained. The dialog embedding is added to the noised latent after the first convolution layer, and all the parameters in the U-Net and feature extractor are trained. In the second stage, the LoRA and resampler weights of an MLLM are fine-tuned to adapt the source character features corresponding to the text prompt. The model in the first stage is used as the image generator, and its weights are frozen.
DiffSensei is accompanied by a large-scale dataset called MangaZero, which contains 43,264 manga pages and 427,147 annotated panels. This dataset supports the visualization of varied character interactions and movements across sequential frames. Extensive experiments demonstrate that DiffSensei outperforms existing models, marking a significant advancement in manga generation by enabling text-adaptable character customization. The code, model, and dataset will be open-sourced to the community, allowing for further development and research in this area.