IF-VidCap: Can Video Caption Models Follow Instructions?

Shihao Li, Yuanxing Zhang, Jiangtao Wu, Zhide Lei, Yiwen He, Runzhe Wen, Chenxi Liao, Chengkang Jiang, An Ping, Shuo Gao, Suhan Wang, Zhaozhou Bian, Zijun Zhou, Jingyi Xie, Jiayi Zhou, Jing Wang, Yifan Yao, Weihao Xie, Yingshui Tan, Yanghai Wang, Qianqian Xie, Zhaoxiang Zhang

2025-10-22

IF-VidCap: Can Video Caption Models Follow Instructions?

Summary

This paper is about how well AI models can create captions for videos when given specific instructions, not just general descriptions.

What's the problem?

Current AI models are good at describing *what* is happening in a video, but they aren't very good at following specific requests about *how* to describe it. Existing tests for video captioning mostly check if the description is thorough, and don't really test if the AI can understand and follow directions like 'write a funny caption' or 'focus on the objects in the background'.

What's the solution?

The researchers created a new test called IF-VidCap with 1,400 videos and instructions. This test specifically looks at two things: whether the caption follows the requested format (like length or style) and whether the caption accurately describes the important parts of the video based on the instructions. They then tested over 20 different AI models on this new benchmark.

Why it matters?

This work is important because it shows that while some AI models are getting better at following instructions for video captions, there's still a lot of room for improvement. It also suggests that models designed for detailed descriptions aren't necessarily the best at following complex instructions, meaning future AI development should focus on both understanding what's happening *and* being able to respond to specific requests.

Abstract

Although Multimodal Large Language Models (MLLMs) have demonstrated proficiency in video captioning, practical applications require captions that follow specific user instructions rather than generating exhaustive, unconstrained descriptions. Current benchmarks, however, primarily assess descriptive comprehensiveness while largely overlooking instruction-following capabilities. To address this gap, we introduce IF-VidCap, a new benchmark for evaluating controllable video captioning, which contains 1,400 high-quality samples. Distinct from existing video captioning or general instruction-following benchmarks, IF-VidCap incorporates a systematic framework that assesses captions on two dimensions: format correctness and content correctness. Our comprehensive evaluation of over 20 prominent models reveals a nuanced landscape: despite the continued dominance of proprietary models, the performance gap is closing, with top-tier open-source solutions now achieving near-parity. Furthermore, we find that models specialized for dense captioning underperform general-purpose MLLMs on complex instructions, indicating that future work should simultaneously advance both descriptive richness and instruction-following fidelity.

View Paper