Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

Yanpeng Sun, Jing Hao, Ke Zhu, Jiang-Jiang Liu, Yuxiang Zhao, Xiaofan Li, Gang Zhang, Zechao Li, Jingdong Wang

2024-12-20

Descriptive Caption Enhancement with Visual Specialists for Multimodal Perception

Summary

This paper talks about a new method called DCE (Descriptive Caption Enhancement) that improves how AI describes images by using specialized visual models. It aims to create more detailed and accurate captions that help AI understand images better.

What's the problem?

Current methods for generating image captions often rely on either basic descriptions from AI models or human-created captions, which can miss important details and context. This leads to oversimplified or inaccurate descriptions, making it harder for AI to understand the images fully.

What's the solution?

DCE uses off-the-shelf visual specialists trained to analyze images in specific ways, such as recognizing emotions or understanding object relationships. By combining these insights with large language models (LLMs), DCE generates richer and more detailed captions. The approach captures both low-level attributes (like depth and fine details) and relationships between objects, resulting in improved image descriptions.

Why it matters?

This research is important because it enhances the ability of AI systems to interpret and describe visual information accurately. Better image captions can improve various applications, such as search engines, social media platforms, and accessibility tools for visually impaired users, making technology more effective and user-friendly.

Abstract

Training Large Multimodality Models (LMMs) relies on descriptive image caption that connects image and language. Existing methods either distill the caption from the LMM models or construct the captions from the internet images or by human. We propose to leverage off-the-shelf visual specialists, which were trained from annotated images initially not for image captioning, for enhancing the image caption. Our approach, named DCE, explores object low-level and fine-grained attributes (e.g., depth, emotion and fine-grained categories) and object relations (e.g., relative location and human-object-interaction (HOI)), and combine the attributes into the descriptive caption. Experiments demonstrate that such visual specialists are able to improve the performance for visual understanding tasks as well as reasoning that benefits from more accurate visual understanding. We will release the source code and the pipeline so that other visual specialists are easily combined into the pipeline. The complete source code of DCE pipeline and datasets will be available at https://github.com/syp2ysy/DCE.

View Paper