OpenVision 3: A Family of Unified Visual Encoder for Both Understanding and Generation
Letian Zhang, Sucheng Ren, Yanqing Liu, Xianhang Li, Zeyu Wang, Yuyin Zhou, Huaxiu Yao, Zeyu Zheng, Weili Nie, Guilin Liu, Zhiding Yu, Cihang Xie
2026-01-23
Summary
This paper introduces OpenVision 3, a new way to create a visual encoder that's good at both understanding what's in an image and creating new images from scratch.
What's the problem?
Traditionally, computer vision systems are built for either understanding images (like identifying objects) or generating images (like creating realistic pictures). These systems usually need separate parts for each task, meaning they don't share information and aren't as efficient. The goal was to create a single system that could do both well.
What's the solution?
The researchers built an encoder that takes compressed versions of images and learns a single 'understanding' of them. This encoder is trained in two ways at the same time: first, it tries to recreate the original image from its compressed version, forcing it to learn important visual details. Second, it learns to connect images with their descriptions and to distinguish between different images, helping it understand the meaning of what it 'sees'. By doing both, the encoder develops a strong, all-around understanding of images.
Why it matters?
This work is important because it moves towards more versatile and efficient AI systems. Having a single encoder that excels at both understanding and generating images could lead to improvements in areas like image editing, creating realistic simulations, and building AI assistants that can interact with the visual world in a more sophisticated way. It also encourages further research into combining different AI tasks into unified models.
Abstract
This paper presents a family of advanced vision encoder, named OpenVision 3, that learns a single, unified visual representation that can serve both image understanding and image generation. Our core architecture is simple: we feed VAE-compressed image latents to a ViT encoder and train its output to support two complementary roles. First, the encoder output is passed to the ViT-VAE decoder to reconstruct the original image, encouraging the representation to capture generative structure. Second, the same representation is optimized with contrastive learning and image-captioning objectives, strengthening semantic features. By jointly optimizing reconstruction- and semantics-driven signals in a shared latent space, the encoder learns representations that synergize and generalize well across both regimes. We validate this unified design through extensive downstream evaluations with the encoder frozen. For multimodal understanding, we plug the encoder into the LLaVA-1.5 framework: it performs comparably with a standard CLIP vision encoder (e.g., 62.4 vs 62.2 on SeedBench, and 83.7 vs 82.9 on POPE). For generation, we test it under the RAE framework: ours substantially surpasses the standard CLIP-based encoder (e.g., gFID: 1.89 vs 2.54 on ImageNet). We hope this work can spur future research on unified modeling.