OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

Hanrong Ye, Chao-Han Huck Yang, Arushi Goel, Wei Huang, Ligeng Zhu, Yuanhang Su, Sean Lin, An-Chieh Cheng, Zhen Wan, Jinchuan Tian, Yuming Lou, Dong Yang, Zhijian Liu, Yukang Chen, Ambrish Dantrey, Ehsan Jahangiri, Sreyan Ghosh, Daguang Xu, Ehsan Hosseini-Asl, Danial Mohseni Taheri, Vidya Murali, Sifei Liu

2025-10-20

OmniVinci: Enhancing Architecture and Data for Omni-Modal Understanding LLM

Summary

This paper introduces OmniVinci, a new, powerful, and openly available artificial intelligence model that can understand and process information from multiple sources like images, audio, and text – similar to how humans perceive the world.

What's the problem?

Current AI models often focus on just one type of data, like text or images. To create truly intelligent systems, we need models that can combine information from different sources, but building these 'omni-modal' models is challenging because it's hard to get the different types of data to work together effectively and requires a lot of training data.

What's the solution?

The researchers tackled this problem by designing a new model architecture with three key improvements. First, they created a way to better align how the model understands images and sounds. Second, they helped the model understand the *order* of events in videos and audio. Third, they improved how the model tracks time within the data. They also created a large dataset of conversations involving multiple types of data to train the model. This new model, OmniVinci, was trained with significantly less data than existing models but still performed better on various tests.

Why it matters?

This work is important because it pushes the field of AI closer to creating systems that can understand the world as we do. OmniVinci’s strong performance with less training data makes it more accessible for researchers and developers. The model also shows promise for real-world applications in areas like robotics, healthcare, and manufacturing, where understanding multiple types of information is crucial.

Abstract

Advancing machine intelligence requires developing the ability to perceive across multiple modalities, much as humans sense the world. We introduce OmniVinci, an initiative to build a strong, open-source, omni-modal LLM. We carefully study the design choices across model architecture and data curation. For model architecture, we present three key innovations: (i) OmniAlignNet for strengthening alignment between vision and audio embeddings in a shared omni-modal latent space; (ii) Temporal Embedding Grouping for capturing relative temporal alignment between vision and audio signals; and (iii) Constrained Rotary Time Embedding for encoding absolute temporal information in omni-modal embeddings. We introduce a curation and synthesis pipeline that generates 24M single-modal and omni-modal conversations. We find that modalities reinforce one another in both perception and reasoning. Our model, OmniVinci, outperforms Qwen2.5-Omni with +19.05 on DailyOmni (cross-modal understanding), +1.7 on MMAR (audio), and +3.9 on Video-MME (vision), while using just 0.2T training tokens - a 6 times reduction compared to Qwen2.5-Omni's 1.2T. We finally demonstrate omni-modal advantages in downstream applications spanning robotics, medical AI, and smart factory.

View Paper