Directional Textual Inversion for Personalized Text-to-Image Generation

Kunhee Kim, NaHyeon Park, Kibeom Hong, Hyunjung Shim

2025-12-16

Directional Textual Inversion for Personalized Text-to-Image Generation

Summary

This paper focuses on improving how we personalize images created from text using a technique called Textual Inversion. Essentially, it's about making AI-generated images better reflect a specific style or subject you want, like a particular person or object.

What's the problem?

Current methods for personalization, like Textual Inversion, often struggle when the instructions (prompts) are complicated. The issue is that the AI learns new 'keywords' to represent your desired style, but these keywords become too strong, messing up how the AI understands the overall prompt. Think of it like shouting one word in a sentence – it drowns out everything else. This happens because the AI changes not just *what* the keyword means, but also *how strongly* it's emphasized, and that strong emphasis throws things off.

What's the solution?

The researchers developed a new approach called Directional Textual Inversion (DTI). Instead of letting the AI change both the meaning *and* strength of the new keywords, DTI keeps the strength constant and only adjusts the meaning. It’s like focusing on the direction of a word, not how loudly it’s said. They use a special mathematical technique to ensure the keyword stays within a reasonable range and optimize it efficiently. This also allows for smooth transitions between different personalized concepts, something the original method couldn't do.

Why it matters?

This work is important because it makes personalized image generation more reliable and controllable. By focusing on the 'direction' of the keywords, the AI can better understand complex prompts and create images that more accurately reflect what you want. Plus, the ability to smoothly blend between different styles opens up new creative possibilities.

Abstract

Textual Inversion (TI) is an efficient approach to text-to-image personalization but often fails on complex prompts. We trace these failures to embedding norm inflation: learned tokens drift to out-of-distribution magnitudes, degrading prompt conditioning in pre-norm Transformers. Empirically, we show semantics are primarily encoded by direction in CLIP token space, while inflated norms harm contextualization; theoretically, we analyze how large magnitudes attenuate positional information and hinder residual updates in pre-norm blocks. We propose Directional Textual Inversion (DTI), which fixes the embedding magnitude to an in-distribution scale and optimizes only direction on the unit hypersphere via Riemannian SGD. We cast direction learning as MAP with a von Mises-Fisher prior, yielding a constant-direction prior gradient that is simple and efficient to incorporate. Across personalization tasks, DTI improves text fidelity over TI and TI-variants while maintaining subject similarity. Crucially, DTI's hyperspherical parameterization enables smooth, semantically coherent interpolation between learned concepts (slerp), a capability that is absent in standard TI. Our findings suggest that direction-only optimization is a robust and scalable path for prompt-faithful personalization.

View Paper