AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration

Xinlong Chen, Yue Ding, Weihong Lin, Jingyun Hua, Linli Yao, Yang Shi, Bozhou Li, Yuanxing Zhang, Qiang Liu, Pengfei Wan, Liang Wang, Tieniu Tan

2025-10-14

AVoCaDO: An Audiovisual Video Captioner Driven by Temporal Orchestration

Summary

This paper introduces AVoCaDO, a new system designed to automatically create detailed descriptions for videos by considering both what's happening visually and what's being said audibly.

What's the problem?

Existing video captioning systems often struggle to create descriptions that accurately reflect *when* specific events happen in the video, and they don't always effectively combine information from both the video and the audio. They might describe things generally, but miss important timing details or ignore crucial audio cues.

What's the solution?

The researchers developed AVoCaDO using a two-step process. First, they fine-tuned a model using a large collection of over 100,000 video captions that were carefully checked for accuracy and timing. Then, they used a special reward system to further improve the captions, making sure they flowed logically, accurately represented dialogue, weren't too long or repetitive, and stayed focused on the video content.

Why it matters?

This work is important because better video captions can help people understand videos more easily, especially those who are deaf or hard of hearing. It also improves how computers can 'understand' videos, which is crucial for tasks like video search, automatic editing, and even creating new videos.

Abstract

Audiovisual video captioning aims to generate semantically rich descriptions with temporal alignment between visual and auditory events, thereby benefiting both video understanding and generation. In this paper, we present AVoCaDO, a powerful audiovisual video captioner driven by the temporal orchestration between audio and visual modalities. We propose a two-stage post-training pipeline: (1) AVoCaDO SFT, which fine-tunes the model on a newly curated dataset of 107K high-quality, temporally-aligned audiovisual captions; and (2) AVoCaDO GRPO, which leverages tailored reward functions to further enhance temporal coherence and dialogue accuracy while regularizing caption length and reducing collapse. Experimental results demonstrate that AVoCaDO significantly outperforms existing open-source models across four audiovisual video captioning benchmarks, and also achieves competitive performance on the VDC and DREAM-1K benchmark under visual-only settings.

View Paper