MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation

Bharath Krishnamurthy, Ajita Rattani

2026-04-01

MMFace-DiT: A Dual-Stream Diffusion Transformer for High-Fidelity Multimodal Face Generation

Summary

This paper introduces a new method, MMFace-DiT, for creating realistic and controllable facial images from both text descriptions and spatial guides like sketches or masks.

What's the problem?

Current methods for generating faces from text *and* spatial information often feel like add-ons to existing image generators. They either tack on extra parts or combine separate systems, which can lead to problems like one type of input (text or sketch) overpowering the other, or the different parts not working well together. This makes it hard to get images that perfectly match both what you want it to *look* like and its overall *structure*.

What's the solution?

The researchers developed MMFace-DiT, a system built from the ground up to handle text and spatial information equally well. It uses a special 'dual-stream transformer' that processes the text and spatial data in parallel, allowing them to influence each other deeply through a shared attention mechanism. This ensures that both the meaning of the text and the structure of the sketch are respected. They also created a 'Modality Embedder' that lets the model adapt to different types of spatial input without needing to be retrained for each one.

Why it matters?

MMFace-DiT significantly improves the quality and accuracy of generated faces compared to existing methods, showing a 40% improvement in visual quality and how well the images match the given instructions. This represents a step forward in creating flexible and powerful tools for generating images that are both semantically meaningful and structurally precise, opening up possibilities for more controlled and creative image editing and generation.

Abstract

Recent multimodal face generation models address the spatial control limitations of text-to-image diffusion models by augmenting text-based conditioning with spatial priors such as segmentation masks, sketches, or edge maps. This multimodal fusion enables controllable synthesis aligned with both high-level semantic intent and low-level structural layout. However, most existing approaches typically extend pre-trained text-to-image pipelines by appending auxiliary control modules or stitching together separate uni-modal networks. These ad hoc designs inherit architectural constraints, duplicate parameters, and often fail under conflicting modalities or mismatched latent spaces, limiting their ability to perform synergistic fusion across semantic and spatial domains. We introduce MMFace-DiT, a unified dual-stream diffusion transformer engineered for synergistic multimodal face synthesis. Its core novelty lies in a dual-stream transformer block that processes spatial (mask/sketch) and semantic (text) tokens in parallel, deeply fusing them through a shared Rotary Position-Embedded (RoPE) Attention mechanism. This design prevents modal dominance and ensures strong adherence to both text and structural priors to achieve unprecedented spatial-semantic consistency for controllable face generation. Furthermore, a novel Modality Embedder enables a single cohesive model to dynamically adapt to varying spatial conditions without retraining. MMFace-DiT achieves a 40% improvement in visual fidelity and prompt alignment over six state-of-the-art multimodal face generation models, establishing a flexible new paradigm for end-to-end controllable generative modeling. The code and dataset are available on our project page: https://vcbsl.github.io/MMFace-DiT/

View Paper