Qihoo-T2X: An Efficiency-Focused Diffusion Transformer via Proxy Tokens for Text-to-Any-Task

Jing Wang, Ao Ma, Jiasong Feng, Dawei Leng, Yuhui Yin, Xiaodan Liang

2024-09-09

Qihoo-T2X: An Efficiency-Focused Diffusion Transformer via Proxy Tokens for Text-to-Any-Task

Summary

This paper talks about Qihoo-T2X, a new type of diffusion transformer model that uses proxy tokens to improve efficiency in generating images and videos from text descriptions.

What's the problem?

Current diffusion transformers use a method called global self-attention, which can be inefficient because it processes a lot of unnecessary information. This redundancy makes it slow and requires a lot of computing power, especially when generating images or videos.

What's the solution?

To solve this problem, the authors introduce the Proxy Token Diffusion Transformer (PT-DiT). Instead of using all tokens in the model, PT-DiT selects a smaller number of 'proxy tokens' from different sections of the input data to represent the overall information. This way, the model can focus on essential details while reducing the amount of computation needed. They also added techniques to improve how the model captures details while still being efficient. The Qihoo-T2X family includes various models designed for different tasks, such as generating images (T2I) and videos (T2V).

Why it matters?

This research is important because it makes image and video generation faster and less resource-intensive. By improving the efficiency of these models, more people can use them on devices with limited computing power, making advanced AI tools more accessible for creative projects.

Abstract

The global self-attention mechanism in diffusion transformers involves redundant computation due to the sparse and redundant nature of visual information, and the attention map of tokens within a spatial window shows significant similarity. To address this redundancy, we propose the Proxy Token Diffusion Transformer (PT-DiT), which employs sparse representative token attention (where the number of representative tokens is much smaller than the total number of tokens) to model global visual information efficiently. Specifically, in each transformer block, we randomly sample one token from each spatial-temporal window to serve as a proxy token for that region. The global semantics are captured through the self-attention of these proxy tokens and then injected into all latent tokens via cross-attention. Simultaneously, we introduce window and shift window attention to address the limitations in detail modeling caused by the sparse attention mechanism. Building on the well-designed PT-DiT, we further develop the Qihoo-T2X family, which includes a variety of models for T2I, T2V, and T2MV tasks. Experimental results show that PT-DiT achieves competitive performance while reducing the computational complexity in both image and video generation tasks (e.g., a 48% reduction compared to DiT and a 35% reduction compared to Pixart-alpha). Our source code is available at https://github.com/360CVGroup/Qihoo-T2X.

View Paper