RONA: Pragmatically Diverse Image Captioning with Coherence Relations
Aashish Anantha Ramakrishnan, Aadarsh Anantha Ramakrishnan, Dongwon Lee
2025-03-27

Summary
This paper is about improving how AI writes captions for images by making them more diverse and human-like.
What's the problem?
AI-generated image captions often focus on simple descriptions of what's in the image and lack the natural flow and context that humans use.
What's the solution?
The researchers developed a new method called RONA that uses 'coherence relations' to guide the AI in writing captions that connect different parts of the image and convey a central message.
Why it matters?
This work matters because it can lead to AI systems that generate more engaging and informative image captions, making them more useful in applications like writing assistants and social media.
Abstract
Writing Assistants (e.g., Grammarly, Microsoft Copilot) traditionally generate diverse image captions by employing syntactic and semantic variations to describe image components. However, human-written captions prioritize conveying a central message alongside visual descriptions using pragmatic cues. To enhance pragmatic diversity, it is essential to explore alternative ways of communicating these messages in conjunction with visual content. To address this challenge, we propose RONA, a novel prompting strategy for Multi-modal Large Language Models (MLLM) that leverages Coherence Relations as an axis for variation. We demonstrate that RONA generates captions with better overall diversity and ground-truth alignment, compared to MLLM baselines across multiple domains. Our code is available at: https://github.com/aashish2000/RONA