AnyStory: Towards Unified Single and Multiple Subject Personalization in Text-to-Image Generation

Junjie He, Yuxiang Tuo, Binghui Chen, Chongyang Zhong, Yifeng Geng, Liefeng Bo

2025-01-17

AnyStory: Towards Unified Single and Multiple Subject Personalization in Text-to-Image Generation

Summary

This paper talks about AnyStory, a new AI system that can create personalized images based on text descriptions. It's like having a super-smart digital artist that can draw exactly what you describe, including specific people or objects, even when you want multiple specific things in one picture.

What's the problem?

Current AI systems are really good at creating images from text descriptions, but they struggle when you want to include specific people or objects, especially when you want more than one specific thing in the same image. It's like asking an artist to draw your friends in different scenarios - the AI might get the scenario right, but struggle to make the people look exactly like your friends.

What's the solution?

The researchers created AnyStory, which works in two steps. First, it uses special AI tools (called ReferenceNet and CLIP) to understand and remember what specific people or objects look like. Then, it uses another tool they call a 'subject router' to figure out where these specific things should go in the image. This two-step process helps AnyStory create images that not only match the text description but also include accurate representations of specific people or objects, even when there are multiple subjects in one image.

Why it matters?

This matters because it could change how we create and use images in many fields. Imagine being able to easily create illustrations for stories with specific characters, or design marketing materials with exact products and people without needing a photoshoot. It could help in fields like education, where teachers could generate custom visual aids, or in entertainment, where creators could quickly visualize scenes or concepts. This technology brings us one step closer to being able to turn our imaginations into realistic images with just a few words.

Abstract

Recently, large-scale generative models have demonstrated outstanding text-to-image generation capabilities. However, generating high-fidelity personalized images with specific subjects still presents challenges, especially in cases involving multiple subjects. In this paper, we propose AnyStory, a unified approach for personalized subject generation. AnyStory not only achieves high-fidelity personalization for single subjects, but also for multiple subjects, without sacrificing subject fidelity. Specifically, AnyStory models the subject personalization problem in an "encode-then-route" manner. In the encoding step, AnyStory utilizes a universal and powerful image encoder, i.e., ReferenceNet, in conjunction with CLIP vision encoder to achieve high-fidelity encoding of subject features. In the routing step, AnyStory utilizes a decoupled instance-aware subject router to accurately perceive and predict the potential location of the corresponding subject in the latent space, and guide the injection of subject conditions. Detailed experimental results demonstrate the excellent performance of our method in retaining subject details, aligning text descriptions, and personalizing for multiple subjects. The project page is at https://aigcdesigngroup.github.io/AnyStory/ .

View Paper