Key Features

Dynamic multi-character control
Diffusion-based image generator
Multimodal large language model (MLLM)
Text-compatible identity adapter
Masked cross-attention for character feature incorporation
Layout control without direct pixel transfer
Flexible adjustments in character expressions, poses, and actions
Large-scale dataset (MangaZero) for training and evaluation

The DiffSensei framework consists of two stages. In the first stage, a multi-character customized manga image generation model with layout control is trained. The dialog embedding is added to the noised latent after the first convolution layer, and all the parameters in the U-Net and feature extractor are trained. In the second stage, the LoRA and resampler weights of an MLLM are fine-tuned to adapt the source character features corresponding to the text prompt. The model in the first stage is used as the image generator, and its weights are frozen.


DiffSensei is accompanied by a large-scale dataset called MangaZero, which contains 43,264 manga pages and 427,147 annotated panels. This dataset supports the visualization of varied character interactions and movements across sequential frames. Extensive experiments demonstrate that DiffSensei outperforms existing models, marking a significant advancement in manga generation by enabling text-adaptable character customization. The code, model, and dataset will be open-sourced to the community, allowing for further development and research in this area.

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme
TurboType Banner

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!