USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning

Shaojin Wu, Mengqi Huang, Yufeng Cheng, Wenxu Wu, Jiahe Tian, Yiming Luo, Fei Ding, Qian He

2025-08-29

USO: Unified Style and Subject-Driven Generation via Disentangled and Reward Learning

Summary

This paper introduces a new model, called USO, that aims to improve how computers generate images with specific styles while still accurately representing the original subject matter. It tackles the challenge of balancing artistic style with content accuracy in image generation.

What's the problem?

Traditionally, computer programs that change an image's style and those that focus on keeping the image's subject the same have been treated as separate problems. Trying to do both at once often leads to compromises – either the style isn't strong enough, or the subject gets distorted. The core issue is that it's hard to separate *what* is in an image (the content) from *how* it looks (the style) and then recombine them effectively.

What's the solution?

The researchers created USO, which works by first building a large collection of images showing the same subject in different styles. Then, they trained the model to learn to identify and separate the style and content of an image. This is done through a special training process that encourages the model to align styles correctly and keep the content distinct. They also added a 'reward' system to further refine the model's ability to generate images that are both stylistically appealing and true to the original subject. Finally, they created a new set of tests, called USO-Bench, to fairly evaluate how well models perform on both style and subject accuracy.

Why it matters?

This work is important because it moves the field closer to creating image generation tools that can truly customize images in a flexible and controlled way. Instead of choosing between style and substance, USO demonstrates a way to achieve both simultaneously, opening up possibilities for more creative and precise image editing and generation applications.

Abstract

Existing literature typically treats style-driven and subject-driven generation as two disjoint tasks: the former prioritizes stylistic similarity, whereas the latter insists on subject consistency, resulting in an apparent antagonism. We argue that both objectives can be unified under a single framework because they ultimately concern the disentanglement and re-composition of content and style, a long-standing theme in style-driven research. To this end, we present USO, a Unified Style-Subject Optimized customization model. First, we construct a large-scale triplet dataset consisting of content images, style images, and their corresponding stylized content images. Second, we introduce a disentangled learning scheme that simultaneously aligns style features and disentangles content from style through two complementary objectives, style-alignment training and content-style disentanglement training. Third, we incorporate a style reward-learning paradigm denoted as SRL to further enhance the model's performance. Finally, we release USO-Bench, the first benchmark that jointly evaluates style similarity and subject fidelity across multiple metrics. Extensive experiments demonstrate that USO achieves state-of-the-art performance among open-source models along both dimensions of subject consistency and style similarity. Code and model: https://github.com/bytedance/USO

View Paper