BRAT: Bonus oRthogonAl Token for Architecture Agnostic Textual Inversion

James Baker

2024-08-12

BRAT: Bonus oRthogonAl Token for Architecture Agnostic Textual Inversion

Summary

This paper discusses BRAT, a new method that enhances the process of Textual Inversion by using bonus tokens and a vision transformer to improve how models learn new subjects and styles.

What's the problem?

Textual Inversion is a popular technique for teaching AI models to generate images based on new concepts using only a few example images. However, most research has focused on using specific models called UNets, which can limit the effectiveness of this technique. Additionally, many approaches do not fully explore how different model architectures can improve learning.

What's the solution?

The authors propose BRAT, which incorporates bonus tokens and orthogonality into the Textual Inversion process. By using a vision transformer instead of relying solely on UNets, BRAT improves how well models adhere to the source images and prompts. This means that the generated images are more accurate representations of the intended concepts. The authors conducted experiments that showed these enhancements lead to better performance in learning new styles and subjects.

Why it matters?

This research is important because it opens up new possibilities for personalizing AI models without being limited by specific architectures. By improving how models learn from fewer images, BRAT can help users create more customized and accurate image generations, making AI tools more versatile and effective in various applications.

Abstract

Textual Inversion remains a popular method for personalizing diffusion models, in order to teach models new subjects and styles. We note that textual inversion has been underexplored using alternatives to the UNet, and experiment with textual inversion with a vision transformer. We also seek to optimize textual inversion using a strategy that does not require explicit use of the UNet and its idiosyncratic layers, so we add bonus tokens and enforce orthogonality. We find the use of the bonus token improves adherence to the source images and the use of the vision transformer improves adherence to the prompt. Code is available at https://github.com/jamesBaker361/tex_inv_plus.

View Paper