Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Peng Sun, Jun Xie, Tao Lin

2026-03-18

Rethinking UMM Visual Generation: Masked Modeling for Efficient Image-Only Pre-training

Summary

This paper focuses on improving how we train Unified Multimodal Models, which are AI systems that can understand and generate both images and text. The research introduces a new training method called Image-Only Training for UMMs (IOMM).

What's the problem?

Training these models is difficult because the part that generates images usually requires a lot of computing power and a huge amount of paired image-text data, which is expensive and hard to get. Existing methods aren't very efficient and rely too much on this limited paired data, creating a bottleneck in performance.

What's the solution?

The researchers propose a two-step training process. First, they train the image generation part of the model using only a massive collection of images *without* any text. This is much faster and doesn't need paired data. Then, they fine-tune the entire model using a smaller, carefully selected set of image-text pairs to make sure it follows instructions well and produces high-quality images. This approach, IOMM, makes training more efficient and improves the model's overall ability.

Why it matters?

This research is important because it makes it easier and cheaper to build powerful AI models that can work with both images and text. The IOMM method achieves better results than previous approaches, even with less computing time, and opens the door for more advanced multimodal AI applications. Their model, IOMM-B, demonstrates state-of-the-art performance on key evaluation benchmarks.

Abstract

Unified Multimodal Models (UMMs) are often constrained by the pre-training of their visual generation components, which typically relies on inefficient paradigms and scarce, high-quality text-image paired data. In this paper, we systematically analyze pre-training recipes for UMM visual generation and identify these two issues as the major bottlenecks. To address them, we propose Image-Only Training for UMMs (IOMM), a data-efficient two-stage training framework. The first stage pre-trains the visual generative component exclusively using abundant unlabeled image-only data, thereby removing the dependency on paired data for this costly phase. The second stage fine-tunes the model using a mixture of unlabeled images and a small curated set of text-image pairs, leading to improved instruction alignment and generative quality. Extensive experiments show that IOMM not only improves training efficiency but also achieves state-of-the-art (SOTA) performance. For example, our IOMM-B (3.6B) model was trained from scratch using only sim 1050 H800 GPU hours (with the vast majority, 1000 hours, dedicated to the efficient image-only pre-training stage). It achieves 0.89 on GenEval and 0.55 on WISE--surpassing strong baselines such as BAGEL-7B (0.82 & 0.55) and BLIP3-o-4B (0.84 & 0.50). Code is available https://github.com/LINs-lab/IOMM{https://github.com/LINs-lab/IOMM}.

View Paper