OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation

Yexin Liu, Manyuan Zhang, Yueze Wang, Hongyu Li, Dian Zheng, Weiming Zhang, Changsheng Lu, Xunliang Cai, Yan Feng, Peng Pei, Harry Yang

2025-12-09

OpenSubject: Leveraging Video-Derived Identity and Diversity Priors for Subject-driven Image Generation and Manipulation

Summary

This paper introduces a new, large dataset called OpenSubject, designed to help computers get better at creating and changing images based on specific people or objects within those images.

What's the problem?

Current image generation models, while improving, often struggle to accurately represent the specific person or object you want in the image, and they especially have trouble when the scene is complicated with many things happening. They might change key features or not place the subject correctly in a busy background.

What's the solution?

The researchers created OpenSubject by starting with a huge collection of videos. They then used a four-step process to carefully select and prepare images from those videos. First, they picked high-quality video clips. Second, they identified pairs of images showing the same person or object. Third, they used clever techniques like 'outpainting' and 'inpainting' to create variations of the images, making them suitable for training the models. Finally, they checked the quality of the generated images and added descriptions. They also created a way to test how well these models perform.

Why it matters?

This work is important because it provides a much better resource for training image generation models. By using OpenSubject, these models can learn to create and manipulate images with greater accuracy, especially in complex scenes, leading to more realistic and controllable image editing and creation tools.

Abstract

Despite the promising progress in subject-driven image generation, current models often deviate from the reference identities and struggle in complex scenes with multiple subjects. To address this challenge, we introduce OpenSubject, a video-derived large-scale corpus with 2.5M samples and 4.35M images for subject-driven generation and manipulation. The dataset is built with a four-stage pipeline that exploits cross-frame identity priors. (i) Video Curation. We apply resolution and aesthetic filtering to obtain high-quality clips. (ii) Cross-Frame Subject Mining and Pairing. We utilize vision-language model (VLM)-based category consensus, local grounding, and diversity-aware pairing to select image pairs. (iii) Identity-Preserving Reference Image Synthesis. We introduce segmentation map-guided outpainting to synthesize the input images for subject-driven generation and box-guided inpainting to generate input images for subject-driven manipulation, together with geometry-aware augmentations and irregular boundary erosion. (iv) Verification and Captioning. We utilize a VLM to validate synthesized samples, re-synthesize failed samples based on stage (iii), and then construct short and long captions. In addition, we introduce a benchmark covering subject-driven generation and manipulation, and then evaluate identity fidelity, prompt adherence, manipulation consistency, and background consistency with a VLM judge. Extensive experiments show that training with OpenSubject improves generation and manipulation performance, particularly in complex scenes.

View Paper