UGC-VideoCaptioner: An Omni UGC Video Detail Caption Model and New Benchmarks
Peiran Wu, Yunze Liu, Zhengdong Zhu, Enmin Zhou, Shawn Shen
2025-07-16
Summary
This paper talks about UGC-VideoCaptioner, a new model and benchmark system designed to create detailed captions for short user-generated videos by understanding both the audio and visual parts.
What's the problem?
The problem is that many existing models struggle to create good captions for short videos that come from users, because they either focus too much on visuals or audio and don't balance both well, and they often lack detailed understanding.
What's the solution?
UGC-VideoCaptioner solves this by using a two-stage training method that first learns from each type of data separately, then combines them with balanced attention to create detailed and accurate captions that describe both what is seen and heard in videos. They also introduced a new benchmark to evaluate how well models perform on this task.
Why it matters?
This matters because user-generated videos are very common on social media, and having AI that can understand and describe them well helps improve accessibility, search, and content recommendations for millions of videos online.
Abstract
UGC-VideoCap introduces a new benchmark and model for detailed omnimodal captioning of short-form user-generated videos, emphasizing balanced audio-visual integration and using a novel two-stage training strategy.