AToken: A Unified Tokenizer for Vision

Jiasen Lu, Liangchen Song, Mingze Xu, Byeongjoo Ahn, Yanjun Wang, Chen Chen, Afshin Dehghan, Yinfei Yang

2025-09-19

Summary

This paper introduces AToken, a new system for processing images, videos, and 3D models in a unified way. It's like creating a single 'language' that computers can use to understand and recreate all these different types of visual information.

What's the problem?

Currently, most systems that can understand images or videos are built for just one type of visual data. If you have a system good at reconstructing images, it won't necessarily be good at understanding what's *in* the image, and vice versa. Also, a system built for images won't work for videos or 3D models without being completely rebuilt. This means a lot of duplicated effort and limits how well AI can handle different visual inputs together.

What's the solution?

The researchers created AToken, which uses a transformer – a type of neural network – to convert all visual data (images, videos, and 3D models) into a common format called a '4D latent space'. Think of it like translating everything into a single code. They also developed a special training method that makes sure the system can both accurately recreate the original visual data *and* understand its content. This training avoids common problems that make these systems unstable. They started by training on simple images and gradually added videos and 3D models to the mix.

Why it matters?

AToken is important because it's a step towards creating more versatile and powerful AI systems. By unifying how visual data is processed, it opens the door to AI that can seamlessly switch between understanding and generating different types of visual content, like turning text into videos or images into 3D models. This could lead to significant advancements in areas like robotics, virtual reality, and content creation.

Abstract

We present AToken, the first unified visual tokenizer that achieves both high-fidelity reconstruction and semantic understanding across images, videos, and 3D assets. Unlike existing tokenizers that specialize in either reconstruction or understanding for single modalities, AToken encodes these diverse visual inputs into a shared 4D latent space, unifying both tasks and modalities in a single framework. Specifically, we introduce a pure transformer architecture with 4D rotary position embeddings to process visual inputs of arbitrary resolutions and temporal durations. To ensure stable training, we introduce an adversarial-free training objective that combines perceptual and Gram matrix losses, achieving state-of-the-art reconstruction quality. By employing a progressive training curriculum, AToken gradually expands from single images, videos, and 3D, and supports both continuous and discrete latent tokens. AToken achieves 0.21 rFID with 82.2% ImageNet accuracy for images, 3.01 rFVD with 32.6% MSRVTT retrieval for videos, and 28.19 PSNR with 90.9% classification accuracy for 3D. In downstream applications, AToken enables both visual generation tasks (e.g., image generation with continuous and discrete tokens, text-to-video generation, image-to-3D synthesis) and understanding tasks (e.g., multimodal LLMs), achieving competitive performance across all benchmarks. These results shed light on the next-generation multimodal AI systems built upon unified visual tokenization.

View Paper