video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Guangzhi Sun, Wenyi Yu, Changli Tang, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, Yuxuan Wang, Chao Zhang

2024-06-25

video-SALMONN: Speech-Enhanced Audio-Visual Large Language Models

Summary

This paper introduces video-SALMONN, a new type of large language model (LLM) designed to understand videos by processing not just the visuals but also the sounds and speech within them. It aims to enhance how machines comprehend video content.

What's the problem?

Understanding videos is complex because they contain various elements like moving images, audio events, and speech. Traditional models often struggle with this multi-faceted information, especially when it comes to accurately interpreting speech in the context of video. This lack of effective understanding limits how well these models can perform tasks related to video content.

What's the solution?

The authors developed video-SALMONN, which uses a special structure called the multi-resolution causal Q-Former (MRC Q-Former) to connect audio and visual data with a language model. This structure helps the model process fine details in speech while still managing other video elements effectively. They also implemented training methods that promote balance among different types of data, preventing any one type from dominating the learning process. As a result, video-SALMONN shows significant improvements in tasks like answering questions about videos and understanding audio-visual content.

Why it matters?

This research is important because it represents a step forward in making AI systems more capable of understanding complex video information. By improving how models interpret both speech and visual data together, video-SALMONN can enhance applications such as video analysis, educational tools, and interactive media. This could lead to more intuitive and effective technology for users in various fields.

Abstract

Speech understanding as an element of the more generic video understanding using audio-visual large language models (av-LLMs) is a crucial yet understudied aspect. This paper proposes video-SALMONN, a single end-to-end av-LLM for video processing, which can understand not only visual frame sequences, audio events and music, but speech as well. To obtain fine-grained temporal information required by speech understanding, while keeping efficient for other video elements, this paper proposes a novel multi-resolution causal Q-Former (MRC Q-Former) structure to connect pre-trained audio-visual encoders and the backbone large language model. Moreover, dedicated training approaches including the diversity loss and the unpaired audio-visual mixed training scheme are proposed to avoid frames or modality dominance. On the introduced speech-audio-visual evaluation benchmark, video-SALMONN achieves more than 25\% absolute accuracy improvements on the video-QA task and over 30\% absolute accuracy improvements on audio-visual QA tasks with human speech. In addition, video-SALMONN demonstrates remarkable video comprehension and reasoning abilities on tasks that are unprecedented by other av-LLMs. Our training code and model checkpoints are available at \url{https://github.com/bytedance/SALMONN/}.

View Paper