Native-Resolution Image Synthesis (NiT)

NEW

NiT introduces three key architectural innovations: Dynamic Tokenization, Variable-Length Sequence Processing, and 2D Structural Prior Injection. Dynamic Tokenization converts images into variable-length token sequences, avoiding input padding and reducing computational overhead. Variable-Length Sequence Processing uses Flash Attention to process heterogeneous token sequences, while 2D Structural Prior Injection introduces axial 2D Rotary Positional Embedding to factorize height and width impact. These innovations enable NiT to efficiently process images of varying resolutions and aspect ratios.


NiT has demonstrated state-of-the-art performance on both ImageNet-256x256 and 512x512 benchmarks, achieving FID scores of 2.03 and 1.45, respectively. Moreover, NiT exhibits strong zero-shot generalization ability, with a FID score of 4.52 on unseen 1024x1024 resolution. NiT also outperforms baselines on resolution generalization and aspect ratio generalization, demonstrating its ability to generate high-quality images across diverse resolutions and aspect ratios. These results make NiT a valuable tool for various applications, including image synthesis, image editing, and computer vision.

Key Features

Native-resolution image synthesis
Arbitrary resolution and aspect ratio generation
Dynamic Tokenization
Variable-Length Sequence Processing
2D Structural Prior Injection
Flash Attention
State-of-the-art performance on ImageNet benchmarks
Strong zero-shot generalization ability

Get more likes & reach the top of search results by adding this button on your site!

Embed button preview - Light theme
Embed button preview - Dark theme

Subscribe to the AI Search Newsletter

Get top updates in AI to your inbox every weekend. It's free!