Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage

Saehyung Lee, Seunghyun Yoon, Trung Bui, Jing Shi, Sungroh Yoon

2024-12-23

Toward Robust Hyper-Detailed Image Captioning: A Multiagent Approach and Dual Evaluation Metrics for Factuality and Coverage

Summary

This paper talks about improving how AI generates detailed captions for images by using a multiagent approach to reduce errors known as hallucinations, where the AI describes things that aren't actually in the image.

What's the problem?

AI models that create captions for images often make mistakes by 'hallucinating' objects or details that aren't present in the image. This happens especially when the models rely too much on their own generated text instead of the actual content of the image. Existing methods to detect these errors are not very effective for detailed captions.

What's the solution?

The authors propose a new method that involves collaboration between different AI models (multiagent approach) to correct and improve the captions. They also introduce a new evaluation framework and dataset to better analyze how accurate these detailed captions are. Their experiments show that their method improves the factual accuracy of captions, even those generated by advanced models like GPT-4V, and highlights limitations in current evaluation methods that don't fully capture how well a model can generate detailed captions.

Why it matters?

This research is important because it helps make AI-generated captions more accurate and reliable, which is crucial for applications like accessibility tools for visually impaired users, content creation, and social media. By reducing hallucinations, we can improve how AI understands and describes visual content, leading to better user experiences.

Abstract

Multimodal large language models (MLLMs) excel at generating highly detailed captions but often produce hallucinations. Our analysis reveals that existing hallucination detection methods struggle with detailed captions. We attribute this to the increasing reliance of MLLMs on their generated text, rather than the input image, as the sequence length grows. To address this issue, we propose a multiagent approach that leverages LLM-MLLM collaboration to correct given captions. Additionally, we introduce an evaluation framework and a benchmark dataset to facilitate the systematic analysis of detailed captions. Our experiments demonstrate that our proposed evaluation method better aligns with human judgments of factuality than existing metrics and that existing approaches to improve the MLLM factuality may fall short in hyper-detailed image captioning tasks. In contrast, our proposed method significantly enhances the factual accuracy of captions, even improving those generated by GPT-4V. Finally, we highlight a limitation of VQA-centric benchmarking by demonstrating that an MLLM's performance on VQA benchmarks may not correlate with its ability to generate detailed image captions.

View Paper