NiT introduces three key architectural innovations: Dynamic Tokenization, Variable-Length Sequence Processing, and 2D Structural Prior Injection. Dynamic Tokenization converts images into variable-length token sequences, avoiding input padding and reducing computational overhead. Variable-Length Sequence Processing uses Flash Attention to process heterogeneous token sequences, while 2D Structural Prior Injection introduces axial 2D Rotary Positional Embedding to factorize height and width impact. These innovations enable NiT to efficiently process images of varying resolutions and aspect ratios.
NiT has demonstrated state-of-the-art performance on both ImageNet-256x256 and 512x512 benchmarks, achieving FID scores of 2.03 and 1.45, respectively. Moreover, NiT exhibits strong zero-shot generalization ability, with a FID score of 4.52 on unseen 1024x1024 resolution. NiT also outperforms baselines on resolution generalization and aspect ratio generalization, demonstrating its ability to generate high-quality images across diverse resolutions and aspect ratios. These results make NiT a valuable tool for various applications, including image synthesis, image editing, and computer vision.