Context Unrolling in Omni Models

Ceyuan Yang, Zhijie Lin, Yang Zhao, Fei Xiao, Hao He, Qi Zhao, Chaorui Deng, Kunchang Li, Zihan Ding, Yuwei Guo, Fuyun Wang, Fangqi Zhu, Xiaonan Nie, Shenhan Zhu, Shanchuan Lin, Hongsheng Li, Weilin Huang, Guang Shi, Haoqi Fan

2026-04-24

Summary

This paper introduces Omni, a new artificial intelligence model that can understand and generate content from many different types of data, like text, pictures, videos, and even 3D models, all at the same time.

What's the problem?

Existing AI models often specialize in just one type of data, like only understanding text or only processing images. This limits their ability to truly understand the world, which is full of different kinds of information all connected together. It's hard for these models to make good decisions or create realistic content when they can't see the bigger picture across different data types.

What's the solution?

The researchers created Omni by training it on a huge amount of diverse data – text, images, videos, and 3D models – all together. This allowed the model to develop a process called 'Context Unrolling,' where it actively considers information from all available sources before making a prediction or generating something new. Essentially, it's like the model thinks through the problem using all the clues it has, instead of just focusing on one piece of information.

Why it matters?

Omni is important because it represents a step towards more versatile and intelligent AI. By being able to handle multiple types of data, it can perform tasks that were previously impossible, like generating a video from a text description or creating a 3D model based on an image. This has potential applications in many fields, from creative content generation to scientific research and beyond, allowing AI to better mimic human understanding.

Abstract

We present Omni, a unified multimodal model natively trained on diverse modalities, including text, images, videos, 3D geometry, and hidden representations. We find that such training enables Context Unrolling, where the model explicitly reasons across multiple modal representations before producing predictions. This process enables the model to aggregate complementary information across heterogeneous modalities, facilitating a more faithful approximation of the shared multimodal knowledge manifold and improving downstream reasoning fidelity. As a result, Omni achieves strong performance on both multimodal generation and understanding benchmarks, while demonstrating advanced multimodal reasoning capabilities, including in-context generation of text, image, video, and 3D geometry.

View Paper