OMCAT: Omni Context Aware Transformer
Arushi Goel, Karan Sapra, Matthieu Le, Rafael Valle, Andrew Tao, Bryan Catanzaro
2024-10-17

Summary
This paper introduces OMCAT, a new model designed to improve how well language models understand and connect information from audio and video, particularly over time.
What's the problem?
While large language models (LLMs) have become good at understanding text, they struggle with understanding events that happen in audio and video together, especially when those events are related over time. This makes it hard for them to answer questions that require linking information from both types of media.
What's the solution?
To solve this problem, the authors created a new dataset called OCTAV, which captures how events transition between audio and video. They also developed OMCAT (Omni Context Aware Transformer), a model that uses a special technique called Rotary Time Embeddings (RoTE) to better understand the timing of events. OMCAT is trained through a three-step process that helps it learn to align and interpret information from different sources effectively. The model was tested and showed excellent performance on tasks that require understanding both audio and visual information.
Why it matters?
This research is important because it enhances the ability of AI systems to process and understand complex information from multiple sources. By improving how these models handle audio and video together, OMCAT can lead to better applications in fields like education, entertainment, and healthcare, where understanding context across different media is crucial.
Abstract
Large Language Models (LLMs) have made significant strides in text generation and comprehension, with recent advancements extending into multimodal LLMs that integrate visual and audio inputs. However, these models continue to struggle with fine-grained, cross-modal temporal understanding, particularly when correlating events across audio and video streams. We address these challenges with two key contributions: a new dataset and model, called OCTAV and OMCAT respectively. OCTAV (Omni Context and Temporal Audio Video) is a novel dataset designed to capture event transitions across audio and video. Second, OMCAT (Omni Context Aware Transformer) is a powerful model that leverages RoTE (Rotary Time Embeddings), an innovative extension of RoPE, to enhance temporal grounding and computational efficiency in time-anchored tasks. Through a robust three-stage training pipeline-feature alignment, instruction tuning, and OCTAV-specific training-OMCAT excels in cross-modal temporal understanding. Our model demonstrates state-of-the-art performance on Audio-Visual Question Answering (AVQA) tasks and the OCTAV benchmark, showcasing significant gains in temporal reasoning and cross-modal alignment, as validated through comprehensive experiments and ablation studies. Our dataset and code will be made publicly available. The link to our demo page is https://om-cat.github.io.