AnyCap Project: A Unified Framework, Dataset, and Benchmark for Controllable Omni-modal Captioning

Yiming Ren, Zhiqiang Lin, Yu Li, Gao Meng, Weiyun Wang, Junjie Wang, Zicheng Lin, Jifeng Dai, Yujiu Yang, Wenhai Wang, Ruihang Chu

2025-07-18

AnyCap Project: A Unified Framework, Dataset, and Benchmark for
Controllable Omni-modal Captioning

Summary

This paper talks about the AnyCap Project, which creates a system to generate captions for different types of media like images, videos, and audio. The system allows users to control how the captions are made in terms of style, content, and tone.

What's the problem?

The problem is that existing captioning systems often can't adjust their descriptions based on user preferences, and they lack precise ways to measure how well captions follow instructions or express different styles.

What's the solution?

The authors built a framework called AnyCapModel that improves caption control without needing to retrain the main model. They also created a large dataset with many examples and different user instructions, plus a new benchmark to better evaluate caption quality based on both accuracy and style.

Why it matters?

This matters because it makes AI captioning much more flexible and accurate, helping users get descriptions that match their specific needs, which can be useful for accessibility, content creation, and better media understanding.

Abstract

The AnyCap Project introduces a framework, dataset, and evaluation protocol to enhance controllability and reliability in multimodal captioning.

View Paper