RICO: Improving Accuracy and Completeness in Image Recaptioning via Visual Reconstruction

Yuchi Wang, Yishuo Cai, Shuhuai Ren, Sihan Yang, Linli Yao, Yuanxin Liu, Yuanxing Zhang, Pengfei Wan, Xu Sun

2025-05-29

RICO: Improving Accuracy and Completeness in Image Recaptioning via
Visual Reconstruction

Summary

This paper talks about RICO, a new system that helps computers write better and more complete captions for images by checking and fixing mistakes using a process that recreates the image from the caption and compares it to the original.

What's the problem?

The problem is that when AI tries to describe images with captions, it often misses important details or makes errors, which means the captions might not fully match what's actually in the picture. This can be a big issue for people who rely on image captions to understand visual content.

What's the solution?

The researchers created a method where the AI writes a caption, then tries to recreate the image using just that caption. If the recreated image doesn't match the original, the AI knows something is missing or wrong in the caption, so it updates the caption and repeats the process until it gets it right. They also made a faster version called RICO-Flash that uses a technique called DPO to make the process more efficient.

Why it matters?

This matters because it helps make image captions much more accurate and detailed, which is important for things like helping visually impaired people understand pictures, improving search engines, and making digital content more accessible to everyone.

Abstract

A novel iterative framework, RICO, improves image caption accuracy by using visual reconstruction and a text-to-image model to refine discrepancies, while RICO-Flash enhances efficiency using DPO.

View Paper