OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

Feilong Tang, Xiang An, Yunyao Yan, Yin Xie, Bin Qin, Kaicheng Yang, Yifei Shen, Yuanhan Zhang, Chunyuan Li, Shikun Feng, Changrui Chen, Huajie Tan, Ming Hu, Manyuan Zhang, Bo Li, Ziyong Feng, Ziwei Liu, Zongyuan Ge, Jiankang Deng

2026-02-16

OneVision-Encoder: Codec-Aligned Sparsity as a Foundational Principle for Multimodal Intelligence

Summary

This paper proposes a new way to process visual information, like images and videos, for artificial intelligence. It argues that current methods are inefficient because they treat all parts of an image or video equally, even though most of it is just background noise. The researchers developed a system called OneVision-Encoder that focuses on the important, changing parts of visuals, similar to how video compression codecs work.

What's the problem?

Existing AI models for understanding images and videos are computationally expensive and inefficient. They process every pixel equally, wasting resources on unchanging background information. This is because these models don't take advantage of the fact that visual data is highly redundant – meaning a lot of it is predictable and doesn't contain much new information. The key issue is a mismatch between how these models are built and the fundamental structure of visual data, specifically the fact that meaningful information is sparse and often related to changes and movement.

What's the solution?

The researchers created OneVision-Encoder, a system that mimics how video codecs compress information. Instead of processing every pixel, it identifies and focuses on the small percentage of areas that actually *change* and contain important information. This is done through a technique called 'Codec Patchification'. The system also uses a special method to understand both spatial (where things are) and temporal (how things move) information simultaneously. It's trained to recognize a huge number of different concepts, helping it understand what it's seeing and how things relate to each other over time.

Why it matters?

This work is important because it demonstrates that AI models can be both more accurate *and* more efficient. By focusing on the essential parts of visual data, OneVision-Encoder achieves better performance than existing models while using fewer resources and less training data. This approach paves the way for building more powerful and scalable AI systems that can truly understand the visual world, and it suggests that aligning AI architecture with information theory principles is a key to progress.

Abstract

Hypothesis. Artificial general intelligence is, at its core, a compression problem. Effective compression demands resonance: deep learning scales best when its architecture aligns with the fundamental structure of the data. These are the fundamental principles. Yet, modern vision architectures have strayed from these truths: visual signals are highly redundant, while discriminative information, the surprise, is sparse. Current models process dense pixel grids uniformly, wasting vast compute on static background rather than focusing on the predictive residuals that define motion and meaning. We argue that to solve visual understanding, we must align our architectures with the information-theoretic principles of video, i.e., Codecs. Method. OneVision-Encoder encodes video by compressing predictive visual structure into semantic meaning. By adopting Codec Patchification, OV-Encoder abandons uniform computation to focus exclusively on the 3.1%-25% of regions rich in signal entropy. To unify spatial and temporal reasoning under irregular token layouts, OneVision-Encoder employs a shared 3D RoPE and is trained with a large-scale cluster discrimination objective over more than one million semantic concepts, jointly capturing object permanence and motion dynamics. Evidence. The results validate our core hypothesis: efficiency and accuracy are not a trade-off; they are positively correlated. When integrated into LLM, it consistently outperforms strong vision backbones such as Qwen3-ViT and SigLIP2 across 16 image, video, and document understanding benchmarks, despite using substantially fewer visual tokens and pretraining data. Notably, on video understanding tasks, OV-Encoder achieves an average improvement of 4.1% over Qwen3-ViT. Codec-aligned, patch-level sparsity is a foundational principle, enabling OV-Encoder as a scalable engine for next-generation visual generalists.

View Paper