video-SALMONN-o1: Reasoning-enhanced Audio-visual Large Language Model
Guangzhi Sun, Yudong Yang, Jimin Zhuang, Changli Tang, Yixuan Li, Wei Li, Zejun MA, Chao Zhang
2025-02-18
Summary
This paper talks about video-SALMONN-o1, a new AI system that can understand and reason about videos by combining audio, visual, and language information. It's like teaching a computer to watch and understand videos the way humans do, but with extra smarts.
What's the problem?
Current AI models are good at solving math problems or understanding pictures, but they struggle with understanding videos in a more complete way. They can't easily combine what they see and hear in a video to answer complex questions or explain what's happening.
What's the solution?
The researchers created video-SALMONN-o1, which can process both the visual and audio parts of a video. They made a special dataset with tricky questions about videos and step-by-step answers to train the AI. They also developed a new way to teach the AI called pDPO, which helps it learn to reason better. To test how well it works, they made RivaBench, a collection of over 4,000 expert-made questions about different types of videos.
Why it matters?
This matters because it could lead to AI that can understand videos much better, which could be useful in many areas. For example, it could help create better closed captions for videos, assist in video editing, or even help detect fake videos. The fact that it can understand both what it sees and hears makes it more like how humans process information, which could lead to more advanced and helpful AI systems in the future.
Abstract
While recent advancements in reasoning optimization have significantly enhanced the capabilities of large language models (LLMs), existing efforts to improve reasoning have been limited to solving mathematical problems and focusing on visual graphical inputs, neglecting broader applications in general video understanding.This paper proposes video-SALMONN-o1, the first open-source reasoning-enhanced audio-visual LLM designed for general video understanding tasks. To enhance its reasoning abilities, we develop a reasoning-intensive dataset featuring challenging audio-visual questions with step-by-step solutions. We also propose process direct preference optimization (pDPO), which leverages contrastive step selection to achieve efficient step-level reward modelling tailored for multimodal inputs. Additionally, we introduce RivaBench, the first reasoning-intensive video understanding benchmark, featuring over 4,000 high-quality, expert-curated question-answer pairs across scenarios such as standup comedy, academic presentations, and synthetic video detection. video-SALMONN-o1 achieves 3-8% accuracy improvements over the LLaVA-OneVision baseline across different video reasoning benchmarks. Besides, pDPO achieves 6-8% improvements compared to the supervised fine-tuning model on RivaBench. Enhanced reasoning enables video-SALMONN-o1 zero-shot synthetic video detection capabilities.