Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

Dongwon Kim, Ju He, Qihang Yu, Chenglin Yang, Xiaohui Shen, Suha Kwak, Liang-Chieh Chen

2025-01-15

Democratizing Text-to-Image Masked Generative Models with Compact Text-Aware One-Dimensional Tokens

Summary

This paper talks about a new way to make AI that turns text into images more accessible and easier to use. The researchers created a tool called TA-TiTok that helps break down images into simple pieces, and another tool called MaskGen that uses these pieces to create new images from text descriptions.

What's the problem?

Current AI systems that turn text into images are really hard to make and often use private data that other researchers can't access. This means only big tech companies with lots of resources can create these powerful AI tools, leaving many researchers and smaller organizations unable to work on or improve this technology.

What's the solution?

The researchers developed TA-TiTok, which is a clever way to break down images into simple, one-dimensional pieces that are easier for computers to work with. They also made it understand text better, which helps it work faster and better. Then, they created MaskGen, which uses these simple image pieces to make new images from text descriptions. The cool part is that they trained MaskGen using only publicly available data, so anyone can recreate and improve upon their work.

Why it matters?

This matters because it makes advanced AI image creation technology available to more people, not just big tech companies. By using public data and sharing their tools openly, the researchers are helping to 'democratize' this field, which means more people can work on improving it. This could lead to new and creative uses for text-to-image AI that we haven't even thought of yet, and it helps ensure that this powerful technology isn't controlled by just a few companies.

Abstract

Image tokenizers form the foundation of modern text-to-image generative models but are notoriously difficult to train. Furthermore, most existing text-to-image models rely on large-scale, high-quality private datasets, making them challenging to replicate. In this work, we introduce Text-Aware Transformer-based 1-Dimensional Tokenizer (TA-TiTok), an efficient and powerful image tokenizer that can utilize either discrete or continuous 1-dimensional tokens. TA-TiTok uniquely integrates textual information during the tokenizer decoding stage (i.e., de-tokenization), accelerating convergence and enhancing performance. TA-TiTok also benefits from a simplified, yet effective, one-stage training process, eliminating the need for the complex two-stage distillation used in previous 1-dimensional tokenizers. This design allows for seamless scalability to large datasets. Building on this, we introduce a family of text-to-image Masked Generative Models (MaskGen), trained exclusively on open data while achieving comparable performance to models trained on private data. We aim to release both the efficient, strong TA-TiTok tokenizers and the open-data, open-weight MaskGen models to promote broader access and democratize the field of text-to-image masked generative models.

View Paper