At the core of UNO’s technology are two major innovations: progressive cross-modal alignment and universal rotary position embedding. The progressive cross-modal alignment uses a two-stage training strategy, first fine-tuning a base text-to-image model with single-subject data, then further training with generated multi-subject data pairs. This approach allows UNO to excel in scenarios where multiple objects or people must be depicted together without losing their individual identities. The universal rotary position embedding technique solves the problem of attribute confusion, ensuring that the model can distinguish and preserve the features of each subject, even in highly detailed or crowded scenes.
UNO’s high-consistency data synthesis pipeline is another standout feature, harnessing the inherent in-context generation capabilities of diffusion transformers. This enables the generation of paired data with high consistency, supporting tasks such as virtual try-ons, product displays, and brand-customized content creation. UNO is open-sourced under the Apache 2.0 license for code and CC BY-NC 4.0 for model weights, making it accessible for researchers and developers. Its intuitive design and robust capabilities make it suitable for a wide range of applications, from e-commerce and advertising to creative design and digital storytelling.
Key features include:
- Supports both single and multi-subject image generation with high consistency
- Progressive cross-modal alignment for precise subject control
- Universal rotary position embedding to prevent attribute confusion
- High-consistency data synthesis pipeline using diffusion transformers
- Enables multi-image conditional input for complex scene creation
- Open-source with accessible training and inference code