Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

Yuran Wang, Bohan Zeng, Chengzhuo Tong, Wenxuan Liu, Yang Shi, Xiaochen Ma, Hao Liang, Yuanxing Zhang, Wentao Zhang

2025-12-17

Scone: Bridging Composition and Distinction in Subject-Driven Image Generation via Unified Understanding-Generation Modeling

Summary

This paper introduces a new method called Scone for creating images from text descriptions, specifically focusing on generating images with multiple objects in a scene.

What's the problem?

Current image generation models struggle when asked to create a scene with several possible objects that fit the description. They have trouble 'distinguishing' which object the text is referring to, leading to incorrect or mixed-up images, especially in complex scenes. Imagine asking for a picture of a cat *and* a dog, but the model isn't sure which part of the image should be the cat and which should be the dog.

What's the solution?

The researchers developed Scone, which works in two parts. First, it learns how to generally put objects together in a scene. Then, it focuses on making sure each object is correctly identified and stays distinct from the others. It does this by using a 'semantic bridge' to clearly communicate what each object should be and by using techniques to prevent the objects from blending together visually. They also created a new way to test how well these models do at both composing scenes and distinguishing objects.

Why it matters?

This work is important because it improves the realism and accuracy of generated images. Being able to reliably create images with multiple, correctly identified objects is crucial for applications like creating realistic virtual environments, designing products, or even just generating art that matches a specific vision.

Abstract

Subject-driven image generation has advanced from single- to multi-subject composition, while neglecting distinction, the ability to identify and generate the correct subject when inputs contain multiple candidates. This limitation restricts effectiveness in complex, realistic visual settings. We propose Scone, a unified understanding-generation method that integrates composition and distinction. Scone enables the understanding expert to act as a semantic bridge, conveying semantic information and guiding the generation expert to preserve subject identity while minimizing interference. A two-stage training scheme first learns composition, then enhances distinction through semantic alignment and attention-based masking. We also introduce SconeEval, a benchmark for evaluating both composition and distinction across diverse scenarios. Experiments demonstrate that Scone outperforms existing open-source models in composition and distinction tasks on two benchmarks. Our model, benchmark, and training data are available at: https://github.com/Ryann-Ran/Scone.

View Paper