TC-AE: Unlocking Token Capacity for Deep Compression Autoencoders
Teng Li, Ziyuan Huang, Cong Chen, Yangfu Li, Yuanhuiyi Lyu, Dandan Zheng, Chunhua Shen, Jun Zhang
2026-04-09
Summary
This paper introduces a new way to compress images using a type of artificial intelligence called a Vision Transformer (ViT), specifically designed to avoid losing important details during the compression process.
What's the problem?
When you try to heavily compress an image, existing methods often make the 'hidden code' representing the image (called the latent representation) too complex, which leads to a loss of quality and makes it hard to recreate a realistic image from that code. Essentially, the compressed image loses its ability to be turned back into something recognizable, and it's often because they try to fix this by making the hidden code bigger and bigger, which isn't ideal.
What's the solution?
The researchers tackled this problem by focusing on how the image is broken down into smaller pieces, called tokens, before being compressed. They realized that aggressively compressing these tokens into the hidden code was the main issue. So, they split the compression process into two steps to preserve more of the image's structure. They also improved the tokens themselves by training the system to understand the meaning of different parts of the image, making the hidden code more useful for recreating the image later.
Why it matters?
This research is important because it provides a better way to compress images without sacrificing quality, especially when you need to compress them a lot. It improves ViT-based image compression and could lead to advancements in generating realistic images from compressed data, which has applications in things like image storage, transmission, and creating new images.
Abstract
We propose TC-AE, a ViT-based architecture for deep compression autoencoders. Existing methods commonly increase the channel number of latent representations to maintain reconstruction quality under high compression ratios. However, this strategy often leads to latent representation collapse, which degrades generative performance. Instead of relying on increasingly complex architectures or multi-stage training schemes, TC-AE addresses this challenge from the perspective of the token space, the key bridge between pixels and image latents, through two complementary innovations: Firstly, we study token number scaling by adjusting the patch size in ViT under a fixed latent budget, and identify aggressive token-to-latent compression as the key factor that limits effective scaling. To address this issue, we decompose token-to-latent compression into two stages, reducing structural information loss and enabling effective token number scaling for generation. Secondly, to further mitigate latent representation collapse, we enhance the semantic structure of image tokens via joint self-supervised training, leading to more generative-friendly latents. With these designs, TC-AE achieves substantially improved reconstruction and generative performance under deep compression. We hope our research will advance ViT-based tokenizer for visual generation.