WithAnyone: Towards Controllable and ID Consistent Image Generation
Hengyuan Xu, Wei Cheng, Peng Xing, Yixiao Fang, Shuhan Wu, Rui Wang, Xianfang Zeng, Daxin Jiang, Gang Yu, Xingjun Ma, Yu-Gang Jiang
2025-10-17
Summary
This paper focuses on creating realistic images of people from text descriptions, specifically making sure the generated image looks like a particular person you specify. It introduces a new model called WithAnyone that's better at this than previous methods.
What's the problem?
Current methods for generating images of specific people often fall short because they rely on limited data. Because there aren't enough examples of the same person in different situations, the models tend to just copy and paste the face from the reference image instead of actually understanding and recreating the person's identity with natural variations like different poses or expressions. This results in images that look unnatural and aren't very flexible.
What's the solution?
The researchers tackled this problem in three main ways. First, they created a large new dataset with lots of images of many different people, showing them in various poses and expressions. Second, they developed a way to measure how much a generated image is just a copy-paste versus a genuine variation of the person's identity. Finally, they designed a new training method for the image generator that encourages it to create diverse images while still accurately representing the person's face, using a technique called 'contrastive identity loss'. This all led to the creation of WithAnyone, a model built on diffusion techniques.
Why it matters?
This work is important because it improves the quality and control we have over generating images of people. By reducing the 'copy-paste' effect, the model can create more realistic and expressive images, allowing users to generate images of themselves or others in a wider range of scenarios and with more natural appearances. This has implications for things like creating personalized avatars, generating content for games, or even artistic expression.
Abstract
Identity-consistent generation has become an important focus in text-to-image research, with recent models achieving notable success in producing images aligned with a reference identity. Yet, the scarcity of large-scale paired datasets containing multiple images of the same individual forces most approaches to adopt reconstruction-based training. This reliance often leads to a failure mode we term copy-paste, where the model directly replicates the reference face rather than preserving identity across natural variations in pose, expression, or lighting. Such over-similarity undermines controllability and limits the expressive power of generation. To address these limitations, we (1) construct a large-scale paired dataset MultiID-2M, tailored for multi-person scenarios, providing diverse references for each identity; (2) introduce a benchmark that quantifies both copy-paste artifacts and the trade-off between identity fidelity and variation; and (3) propose a novel training paradigm with a contrastive identity loss that leverages paired data to balance fidelity with diversity. These contributions culminate in WithAnyone, a diffusion-based model that effectively mitigates copy-paste while preserving high identity similarity. Extensive qualitative and quantitative experiments demonstrate that WithAnyone significantly reduces copy-paste artifacts, improves controllability over pose and expression, and maintains strong perceptual quality. User studies further validate that our method achieves high identity fidelity while enabling expressive controllable generation.