Effective and Efficient Masked Image Generation Models
Zebin You, Jingyang Ou, Xiaolu Zhang, Jun Hu, Jun Zhou, Chongxuan Li
2025-03-11
Summary
This paper talks about eMIGM, a smarter AI model for creating images by filling in missing parts efficiently, like solving a puzzle faster and better than older methods.
What's the problem?
Current AI image models either take too long to generate high-quality pictures or struggle to balance speed and accuracy when filling in missing areas.
What's the solution?
eMIGM combines ideas from different masked image models, finds the best settings for training and generating, and uses fewer steps to make detailed images quickly.
Why it matters?
This lets artists and designers create high-quality images (like photos or designs) faster and with less computing power, making AI tools more practical for everyday use.
Abstract
Although masked image generation models and masked diffusion models are designed with different motivations and objectives, we observe that they can be unified within a single framework. Building upon this insight, we carefully explore the design space of training and sampling, identifying key factors that contribute to both performance and efficiency. Based on the improvements observed during this exploration, we develop our model, referred to as eMIGM. Empirically, eMIGM demonstrates strong performance on ImageNet generation, as measured by Fr\'echet Inception Distance (FID). In particular, on ImageNet 256x256, with similar number of function evaluations (NFEs) and model parameters, eMIGM outperforms the seminal VAR. Moreover, as NFE and model parameters increase, eMIGM achieves performance comparable to the state-of-the-art continuous diffusion models while requiring less than 40% of the NFE. Additionally, on ImageNet 512x512, with only about 60% of the NFE, eMIGM outperforms the state-of-the-art continuous diffusion models.