AlignBench: Benchmarking Fine-Grained Image-Text Alignment with Synthetic Image-Caption Pairs
Kuniaki Saito, Risa Shinoda, Shohei Tanaka, Tosho Hirasawa, Fumio Okura, Yoshitaka Ushiku
2025-12-04
Summary
This paper focuses on how well different computer models understand the connection between images and the text that describes them, specifically looking at models like CLIP that try to bridge the gap between vision and language.
What's the problem?
Currently, the ways we test these image-text models aren't very good at catching subtle errors in alignment. Existing tests either change images in simple ways or use very short descriptions, so they don't really challenge the models to deeply understand the relationship between what they 'see' and what they 'read'. This means we don't have a reliable way to know if a model *truly* understands the content of an image when it's described with a detailed caption.
What's the solution?
The researchers created a new testing tool called AlignBench. This tool uses detailed image-caption pairs created by other AI models, and then has people check if each sentence in the caption accurately describes the image. By analyzing how well different models can judge the correctness of these captions, they can evaluate how well those models understand image-text alignment. They tested many different models and found some surprising results about their strengths and weaknesses.
Why it matters?
This research is important because accurately assessing image-text alignment is crucial for building AI systems that can truly understand and interact with the world around them. The findings reveal that even advanced models struggle with this task, and that models often have biases, like preferring captions they themselves generated. This work helps point the way towards building better, more reliable AI systems that can connect images and language in a meaningful way.
Abstract
Assessing image-text alignment models such as CLIP is crucial for bridging visual and linguistic representations. Yet existing benchmarks rely on rule-based perturbations or short captions, limiting their ability to measure fine-grained alignment. We introduce AlignBench, a benchmark that provides a new indicator of image-text alignment by evaluating detailed image-caption pairs generated by diverse image-to-text and text-to-image models. Each sentence is annotated for correctness, enabling direct assessment of VLMs as alignment evaluators. Benchmarking a wide range of decoder-based VLMs reveals three key findings: (i) CLIP-based models, even those tailored for compositional reasoning, remain nearly blind; (ii) detectors systematically over-score early sentences; and (iii) they show strong self-preference, favoring their own outputs and harming detection performance. Our project page will be available at https://dahlian00.github.io/AlignBench/.