ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality Debiasing

Long Xing, Qidong Huang, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Yuhang Cao, Jinsong Li, Shuangrui Ding, Weiming Zhang, Nenghai Yu, Jiaqi Wang, Feng Wu, Dahua Lin

2025-06-25

ScaleCap: Inference-Time Scalable Image Captioning via Dual-Modality
Debiasing

Summary

This paper talks about ScaleCap, a new method that improves how AI models write captions for images by making the captions more accurate, balanced, and informative through a special process during inference time.

What's the problem?

The problem is that image captioning models often have biases in understanding both visuals and language that make their captions less accurate or fair, and existing methods don’t effectively fix these biases while generating captions.

What's the solution?

The researchers developed a process that repeatedly improves captions by asking questions about the image and rating sentences to check for bias and quality, using dual-modality debiasing techniques that focus on both visual and language data during caption generation.

Why it matters?

This matters because better image captions can help people with visual impairments, improve search engines, and make AI communication clearer and more trustworthy when describing what’s in images.

Abstract

ScaleCap enhances image captioning by iteratively enriching and calibrating captions using heuristic question answering and contrastive sentence rating, addressing multimodal and linguistic biases to improve accuracy, balance, and informativeness.

View Paper