UniF^2ace: Fine-grained Face Understanding and Generation with Unified Multimodal Models

Junzhe Li, Xuerui Qiu, Linrui Xu, Liya Guo, Delin Qu, Tingting Long, Chun Fan, Ming Li

2025-03-12

UniF^2ace: Fine-grained Face Understanding and Generation
with Unified Multimodal Models

Summary

This paper talks about UniF^2ace, an AI that understands and creates detailed face images, noticing tiny features like dimples or eyebrow shapes, and can answer questions about them or generate new faces matching descriptions.

What's the problem?

Current AI models for faces either recognize basic features (like hair color) or generate images, but can’t do both well or handle fine details like freckles or wrinkles accurately.

What's the solution?

UniF^2ace uses a mix of AI techniques (diffusion models and expert networks) trained on a huge dataset of faces with detailed descriptions, helping it spot and create small details while answering questions about them.

Why it matters?

This helps in security (like ID checks), medical analysis (spotting skin conditions), and creative work (making realistic characters) by giving AI better tools to handle faces precisely.

Abstract

Unified multimodal models (UMMs) have emerged as a powerful paradigm in foundational computer vision research, demonstrating significant potential in both image understanding and generation. However, existing research in the face domain primarily focuses on coarse facial attribute understanding, with limited capacity to handle fine-grained facial attributes and without addressing generation capabilities. To overcome these limitations, we propose UniF^2ace, the first UMM tailored specifically for fine-grained face understanding and generation. In general, we train UniF^2ace on a self-constructed, specialized dataset utilizing two mutually beneficial diffusion techniques and a two-level mixture-of-experts architecture. Specifically, we first build a large-scale facial dataset, UniF^2ace-130K, which contains 130K image-text pairs with one million question-answering pairs that span a wide range of facial attributes. Second, we establish a theoretical connection between discrete diffusion score matching and masked generative models, optimizing both evidence lower bounds simultaneously, which significantly improves the model's ability to synthesize facial details. Finally, we introduce both token-level and sequence-level mixture-of-experts, enabling efficient fine-grained representation learning for both understanding and generation tasks. Extensive experiments on UniF^2ace-130K demonstrate that UniF^2ace outperforms existing UMMs and generative models, achieving superior performance across both understanding and generation tasks.

View Paper