GenMask: Adapting DiT for Segmentation via Direct Mask

Yuhuan Yang, Xianwei Zhuang, Yuxuan Cai, Chaofan Ma, Shuai Bai, Jiangchao Yao, Ya Zhang, Junyang Lin, Yanfeng Wang

2026-03-30

GenMask: Adapting DiT for Segmentation via Direct Mask

Summary

This paper introduces a new approach to image segmentation called GenMask, which trains a generative model to directly create segmentation masks alongside realistic images, rather than adapting a pre-trained model.

What's the problem?

Current methods for segmentation often use pre-trained generative models to extract features, then apply those features to the segmentation task. This is problematic because the way these models represent images doesn't naturally align with the characteristics of segmentation masks – masks are very precise and distinct, unlike the more gradual variations in natural images. Also, these methods require complicated steps to get the features needed for segmentation, making the process less efficient.

What's the solution?

The researchers realized that to make this work better, segmentation needs to be trained *within* the generative process itself. The main challenge was that segmentation masks have very different properties in the model's internal representation (called the 'latent space') compared to regular images. To fix this, they developed a new way of adding noise during training. They used a lot of noise when creating the masks, and a more moderate amount when creating the images, which helped the model learn to represent both in a compatible way. GenMask uses a model called DiT to generate both the black-and-white segmentation masks and the color images at the same time.

Why it matters?

GenMask simplifies the segmentation process by removing the need for separate feature extraction steps and achieves top-level performance on standard segmentation tasks. This means it's a more efficient and effective way to identify objects and regions within images, which is important for applications like self-driving cars, medical imaging, and image editing.

Abstract

Recent approaches for segmentation have leveraged pretrained generative models as feature extractors, treating segmentation as a downstream adaptation task via indirect feature retrieval. This implicit use suffers from a fundamental misalignment in representation. It also depends heavily on indirect feature extraction pipelines, which complicate the workflow and limit adaptation. In this paper, we argue that instead of indirect adaptation, segmentation tasks should be trained directly in a generative manner. We identify a key obstacle to this unified formulation: VAE latents of binary masks are sharply distributed, noise robust, and linearly separable, distinct from natural image latents. To bridge this gap, we introduce timesteps sampling strategy for binary masks that emphasizes extreme noise levels for segmentation and moderate noise for image generation, enabling harmonious joint training. We present GenMask, a DiT trains to generate black-and-white segmentation masks as well as colorful images in RGB space under the original generative objective. GenMask preserves the original DiT architecture while removing the need of feature extraction pipelines tailored for segmentation tasks. Empirically, GenMask attains state-of-the-art performance on referring and reasoning segmentation benchmarks and ablations quantify the contribution of each component.

View Paper