UniWeTok: An Unified Binary Tokenizer with Codebook Size 2^{128} for Unified Multimodal Large Language Model
Shaobin Zhuang, Yuang Ai, Jiaming Han, Weijia Mao, Xiaohui Li, Fangyikang Wang, Xiao Wang, Yan Li, Shanchuan Lin, Kun Xu, Zhenheng Yang, Huaibo Huang, Xiangyu Yue, Hao Chen, Yali Wang
2026-02-17
Summary
This paper introduces a new way to process images for use with advanced AI models called Unified Multimodal Large Language Models (MLLMs). It focuses on creating a better 'visual vocabulary' that allows these models to understand and generate images more effectively.
What's the problem?
Current methods for converting images into a format that AI can understand often struggle to do everything well at once. They might be good at perfectly recreating the image, but bad at understanding *what* is in the image, or good at understanding but not able to generate new, realistic images. Essentially, there's a trade-off between detail, understanding, and the ability to create new content.
What's the solution?
The researchers developed a system called UniWeTok which uses a very large 'codebook' of visual elements to represent images. They also created a new training process, called Pre-Post Distillation, and a 'Generative-Aware Prior' to help the system better understand images and generate new ones. They also designed a special architecture for processing the images using a combination of convolutional and attention techniques, along with a new activation function called SigLu, to improve stability and performance. Finally, they trained the system in stages to make it work well with different image sizes and details like faces and text.
Why it matters?
UniWeTok achieves better image generation quality than previous methods while requiring significantly less computing power for training. It also performs well on a variety of tasks, including understanding images, creating new images, and editing existing ones. This is important because it makes these powerful AI models more accessible and efficient, paving the way for more advanced applications in areas like image editing, content creation, and visual understanding.
Abstract
Unified Multimodal Large Language Models (MLLMs) require a visual representation that simultaneously supports high-fidelity reconstruction, complex semantic extraction, and generative suitability. However, existing visual tokenizers typically struggle to satisfy these conflicting objectives within a single framework. In this paper, we introduce UniWeTok, a unified discrete tokenizer designed to bridge this gap using a massive binary codebook (2^{128}). For training framework, we introduce Pre-Post Distillation and a Generative-Aware Prior to enhance the semantic extraction and generative prior of the discrete tokens. In terms of model architecture, we propose a convolution-attention hybrid architecture with the SigLu activation function. SigLu activation not only bounds the encoder output and stabilizes the semantic distillation process but also effectively addresses the optimization conflict between token entropy loss and commitment loss. We further propose a three-stage training framework designed to enhance UniWeTok's adaptability cross various image resolutions and perception-sensitive scenarios, such as those involving human faces and textual content. On ImageNet, UniWeTok achieves state-of-the-art image generation performance (FID: UniWeTok 1.38 vs. REPA 1.42) while requiring a remarkably low training compute (Training Tokens: UniWeTok 33B vs. REPA 262B). On general-domain, UniWeTok demonstrates highly competitive capabilities across a broad range of tasks, including multimodal understanding, image generation (DPG Score: UniWeTok 86.63 vs. FLUX.1 [Dev] 83.84), and editing (GEdit Overall Score: UniWeTok 5.09 vs. OmniGen 5.06). We release code and models to facilitate community exploration of unified tokenizer and MLLM.