AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning
Yiming Ren, Zhiqiang Lin, Yu Li, Gao Meng, Weiyun Wang, Junjie Wang, Zicheng Lin, Jifeng Dai, Yujiu Yang, Wenhai Wang, Ruihang Chu
2025-07-18
Summary
This paper talks about the AnyCap Project, which creates a system to generate captions for different types of media like images, videos, and audio. The system allows users to control how the captions are made in terms of style, content, and tone.
What's the problem?
The problem is that existing captioning systems often can't adjust their descriptions based on user preferences, and they lack precise ways to measure how well captions follow instructions or express different styles.
What's the solution?
The authors built a framework called AnyCapModel that improves caption control without needing to retrain the main model. They also created a large dataset with many examples and different user instructions, plus a new benchmark to better evaluate caption quality based on both accuracy and style.
Why it matters?
This matters because it makes AI captioning much more flexible and accurate, helping users get descriptions that match their specific needs, which can be useful for accessibility, content creation, and better media understanding.
Abstract
The AnyCap Project introduces a framework, dataset, and evaluation protocol to enhance controllability and reliability in multimodal captioning.