SAMTok: Representing Any Mask with Two Words
Yikang Zhou, Tao Zhang, Dengxian Gong, Yuanzheng Wu, Ye Tian, Haochen Wang, Haobo Yuan, Jiacong Wang, Lu Qi, Hao Fei, Anran Wang, Zhuochen Wang, Yujing Wang, Cheng Chen, Shunping Ji, Xiangtai Li
2026-01-23
Summary
This paper introduces a new method, called SAMTok, for giving large visual-language models (MLLMs) the ability to understand and interact with images at a very detailed, pixel-by-pixel level.
What's the problem?
Currently, making MLLMs understand specific parts of an image is really hard. Existing methods require complicated designs for processing image regions, special tools for creating image 'masks' that highlight areas, and different ways of training the model, making it difficult to improve and scale these systems.
What's the solution?
SAMTok solves this by turning image masks – essentially outlines of objects or areas in a picture – into a simple code, like words in a sentence. The model then learns to understand and create these masks just by predicting the next 'code word', similar to how it predicts the next word in text. This doesn't require changing the basic structure of the MLLM or using complicated training methods. They trained SAMTok using a huge dataset of masks and then used it to improve a model called QwenVL.
Why it matters?
This work is important because it provides a much simpler and more effective way to give MLLMs detailed visual understanding. This means these models can perform tasks like accurately describing parts of an image, answering questions about specific objects, and even interacting with images in a more natural way, all without needing complex and hard-to-scale systems.
Abstract
Pixel-wise capabilities are essential for building interactive intelligent systems. However, pixel-wise multi-modal LLMs (MLLMs) remain difficult to scale due to complex region-level encoders, specialized segmentation decoders, and incompatible training objectives. To address these challenges, we present SAMTok, a discrete mask tokenizer that converts any region mask into two special tokens and reconstructs the mask using these tokens with high fidelity. By treating masks as new language tokens, SAMTok enables base MLLMs (such as the QwenVL series) to learn pixel-wise capabilities through standard next-token prediction and simple reinforcement learning, without architectural modifications and specialized loss design. SAMTok builds on SAM2 and is trained on 209M diverse masks using a mask encoder and residual vector quantizer to produce discrete, compact, and information-rich tokens. With 5M SAMTok-formatted mask understanding and generation data samples, QwenVL-SAMTok attains state-of-the-art or comparable results on region captioning, region VQA, grounded conversation, referring segmentation, scene graph parsing, and multi-round interactive segmentation. We further introduce a textual answer-matching reward that enables efficient reinforcement learning for mask generation, delivering substantial improvements on GRES and GCG benchmarks. Our results demonstrate a scalable and straightforward paradigm for equipping MLLMs with strong pixel-wise capabilities. Our code and models are available.