MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation

Minhyun Lee, Seungho Lee, Song Park, Dongyoon Han, Byeongho Heo, Hyunjung Shim

2024-12-04

MaskRIS: Semantic Distortion-aware Data Augmentation for Referring Image Segmentation

Summary

This paper presents MaskRIS, a new method for improving Referring Image Segmentation (RIS) by using a technique called semantic distortion-aware data augmentation.

What's the problem?

Referring Image Segmentation (RIS) is a task where AI needs to identify and segment objects in an image based on text descriptions. However, traditional data augmentation methods don't work well for RIS because they can introduce errors or irrelevant information, which leads to poorer performance. This makes it difficult for AI models to accurately understand and segment the objects as described.

What's the solution?

To address this issue, the researchers developed MaskRIS, which uses a novel approach that combines both image and text masking along with a technique called Distortion-aware Contextual Learning (DCL). This method helps the AI model focus on the most important parts of the image while ignoring unnecessary details. By using random masking, MaskRIS enhances the model's ability to deal with incomplete information and complex language descriptions. The experiments showed that MaskRIS significantly improves the performance of RIS models compared to existing methods.

Why it matters?

This research is important because it enhances how AI systems can understand and segment images based on text descriptions, making them more effective in real-world applications like robotics, augmented reality, and image editing. By improving the accuracy of RIS, MaskRIS can lead to better user experiences in various fields that rely on precise image analysis.

Abstract

Referring Image Segmentation (RIS) is an advanced vision-language task that involves identifying and segmenting objects within an image as described by free-form text descriptions. While previous studies focused on aligning visual and language features, exploring training techniques, such as data augmentation, remains underexplored. In this work, we explore effective data augmentation for RIS and propose a novel training framework called Masked Referring Image Segmentation (MaskRIS). We observe that the conventional image augmentations fall short of RIS, leading to performance degradation, while simple random masking significantly enhances the performance of RIS. MaskRIS uses both image and text masking, followed by Distortion-aware Contextual Learning (DCL) to fully exploit the benefits of the masking strategy. This approach can improve the model's robustness to occlusions, incomplete information, and various linguistic complexities, resulting in a significant performance improvement. Experiments demonstrate that MaskRIS can easily be applied to various RIS models, outperforming existing methods in both fully supervised and weakly supervised settings. Finally, MaskRIS achieves new state-of-the-art performance on RefCOCO, RefCOCO+, and RefCOCOg datasets. Code is available at https://github.com/naver-ai/maskris.

View Paper