VITA: Towards Open-Source Interactive Omni Multimodal LLM
Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Xiong Wang, Di Yin, Long Ma, Xiawu Zheng, Ran He, Rongrong Ji, Yunsheng Wu, Caifeng Shan, Xing Sun
2024-08-12

Summary
This paper introduces VITA, the first open-source Multimodal Large Language Model (MLLM) that can process and analyze video, images, text, and audio simultaneously while providing an advanced interactive experience.
What's the problem?
While advanced models like GPT-4o show great capabilities in handling multiple types of data, most open-source models struggle to match these features. This limits their practical applications and makes it harder for developers to create versatile AI systems that can understand and interact with different forms of media effectively.
What's the solution?
VITA is built on a language model called Mixtral 8x7B and has been enhanced to support multiple languages, including Chinese. It uses a two-step learning process to improve its ability to understand visual and audio information alongside text. One of the key innovations in VITA is its ability to interact with users without needing a wake-up word, allowing for more natural conversations. Additionally, it can handle interruptions from users while it's generating responses, making the interaction feel more fluid and responsive.
Why it matters?
This research is significant because it represents a major step forward in creating open-source AI models that can handle various types of input seamlessly. By improving how machines understand and interact with different media, VITA can help advance fields like education, entertainment, and customer service, making technology more accessible and user-friendly.
Abstract
The remarkable multimodal capabilities and interactive experience of GPT-4o underscore their necessity in practical applications, yet open-source models rarely excel in both areas. In this paper, we introduce VITA, the first-ever open-source Multimodal Large Language Model (MLLM) adept at simultaneous processing and analysis of Video, Image, Text, and Audio modalities, and meanwhile has an advanced multimodal interactive experience. Starting from Mixtral 8x7B as a language foundation, we expand its Chinese vocabulary followed by bilingual instruction tuning. We further endow the language model with visual and audio capabilities through two-stage multi-task learning of multimodal alignment and instruction tuning. VITA demonstrates robust foundational capabilities of multilingual, vision, and audio understanding, as evidenced by its strong performance across a range of both unimodal and multimodal benchmarks. Beyond foundational capabilities, we have made considerable progress in enhancing the natural multimodal human-computer interaction experience. To the best of our knowledge, we are the first to exploit non-awakening interaction and audio interrupt in MLLM. VITA is the first step for the open-source community to explore the seamless integration of multimodal understanding and interaction. While there is still lots of work to be done on VITA to get close to close-source counterparts, we hope that its role as a pioneer can serve as a cornerstone for subsequent research. Project Page: https://vita-home.github.io.