Vision as a Dialect: Unifying Visual Understanding and Generation via Text-Aligned Representations

Jiaming Han, Hao Chen, Yang Zhao, Hanyu Wang, Qi Zhao, Ziyan Yang, Hao He, Xiangyu Yue, Lu Jiang

2025-06-24

Vision as a Dialect: Unifying Visual Understanding and Generation via
Text-Aligned Representations

Summary

This paper talks about a new multimodal AI framework that unifies how machines understand and generate images and text by using a Text-Aligned Tokenizer (TA-Tok) which converts images into special tokens aligned with language models.

What's the problem?

The problem is that previous AI systems treated images and texts separately, making it hard to connect understanding and generation smoothly across both types of data with efficiency and high quality.

What's the solution?

The researchers created TA-Tok, which uses a shared vocabulary with language models to represent images as discrete tokens aligned with text tokens. They combined this with two different decoding models — one fast and one high-quality — that generate images from these tokens. They also introduced methods to balance detail and efficiency in processing.

Why it matters?

This matters because it helps AI systems better combine and switch between visual and language tasks seamlessly, improving applications like image captioning, text-to-image generation, and multimodal reasoning with faster training and better results.

Abstract

A multimodal framework uses a Text-Aligned Tokenizer (TA-Tok) to integrate vision and text into a unified space, employing a generative de-tokenizer with autoregressive and diffusion-based models for efficient and high-fidelity visual outputs.

View Paper