End-to-End Vision Tokenizer Tuning

Wenxuan Wang, Fan Zhang, Yufeng Cui, Haiwen Diao, Zhuoyan Luo, Huchuan Lu, Jing Liu, Xinlong Wang

2025-05-16

Summary

This paper talks about ETT, a new way to train the part of an AI model that turns images into data it can understand, making the whole process work better for tasks that involve both pictures and text.

What's the problem?

The problem is that when AI models try to understand or generate images along with text, the step where they break down images into understandable pieces, called tokenization, isn't always done in the best way, which can hurt the model's overall performance.

What's the solution?

The researchers created an end-to-end training method where the visual tokenizer is trained together with the rest of the model while it's learning to do tasks involving images and text. This makes the tokenizer much better at preparing images for the AI to work with, leading to improved results in understanding and creating visuals.

Why it matters?

This matters because it helps AI systems get much better at combining what they see and what they read or write, which is useful for things like smart assistants, image search, and creative tools.

Abstract

ETT is an end-to-end vision tokenizer tuning method that integrates visual tokenizer training with autoregressive tasks, significantly improving performance in multimodal understanding and visual generation.

View Paper