TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space

Daniel Garibi, Shahar Yadin, Roni Paiss, Omer Tov, Shiran Zada, Ariel Ephrat, Tomer Michaeli, Inbar Mosseri, Tali Dekel

2025-01-22

TokenVerse: Versatile Multi-concept Personalization in Token Modulation Space

Summary

This paper talks about TokenVerse, a new AI method that can learn and combine different visual concepts from images to create new, personalized pictures. It's like teaching a computer to understand and mix different parts of images, such as objects, materials, or lighting, in creative ways.

What's the problem?

Current AI systems for creating images often struggle to separate and recombine different visual elements from multiple pictures. It's like trying to take the smile from one photo, the hairstyle from another, and the background from a third, but not being able to mix them smoothly or naturally. This limits how creative and flexible these AI systems can be when generating new images.

What's the solution?

The researchers created TokenVerse, which uses a clever trick with something called 'token modulation space' in AI image generation models. It's like giving the AI a special set of knobs for each word in a description, allowing it to fine-tune how that word affects the final image. TokenVerse can learn these 'knobs' from just one image and its description, and then use them to create new images that mix and match different concepts in any way you want.

Why it matters?

This matters because it could make AI image generation much more flexible and personalized. Imagine being able to describe any combination of objects, styles, or scenes, and have an AI create exactly what you're thinking of, even if it's never seen that exact combination before. This could be huge for artists, designers, and anyone who needs to create custom visuals quickly. It's a big step towards AI that can truly understand and manipulate visual concepts the way humans do, opening up new possibilities for creativity and design.

Abstract

We present TokenVerse -- a method for multi-concept personalization, leveraging a pre-trained text-to-image diffusion model. Our framework can disentangle complex visual elements and attributes from as little as a single image, while enabling seamless plug-and-play generation of combinations of concepts extracted from multiple images. As opposed to existing works, TokenVerse can handle multiple images with multiple concepts each, and supports a wide-range of concepts, including objects, accessories, materials, pose, and lighting. Our work exploits a DiT-based text-to-image model, in which the input text affects the generation through both attention and modulation (shift and scale). We observe that the modulation space is semantic and enables localized control over complex concepts. Building on this insight, we devise an optimization-based framework that takes as input an image and a text description, and finds for each word a distinct direction in the modulation space. These directions can then be used to generate new images that combine the learned concepts in a desired configuration. We demonstrate the effectiveness of TokenVerse in challenging personalization settings, and showcase its advantages over existing methods. project's webpage in https://token-verse.github.io/

View Paper