At the core of UNO’s technology are two major innovations: progressive cross-modal alignment and universal rotary position embedding. The progressive cross-modal alignment uses a two-stage training strategy, first fine-tuning a base text-to-image model with single-subject data, then further training with generated multi-subject data pairs. This approach allows UNO to excel in scenarios where multiple objects or people must be depicted together without losing their individual identities. The universal rotary position embedding technique solves the problem of attribute confusion, ensuring that the model can distinguish and preserve the features of each subject, even in highly detailed or crowded scenes.


UNO’s high-consistency data synthesis pipeline is another standout feature, harnessing the inherent in-context generation capabilities of diffusion transformers. This enables the generation of paired data with high consistency, supporting tasks such as virtual try-ons, product displays, and brand-customized content creation. UNO is open-sourced under the Apache 2.0 license for code and CC BY-NC 4.0 for model weights, making it accessible for researchers and developers. Its intuitive design and robust capabilities make it suitable for a wide range of applications, from e-commerce and advertising to creative design and digital storytelling.


Key features include:


  • Supports both single and multi-subject image generation with high consistency
  • Progressive cross-modal alignment for precise subject control
  • Universal rotary position embedding to prevent attribute confusion
  • High-consistency data synthesis pipeline using diffusion transformers
  • Enables multi-image conditional input for complex scene creation
  • Open-source with accessible training and inference code

Get more likes & reach the top of search results by adding this button on your site!

Featured on

AI Search

56

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!