A High-Quality Dataset and Reliable Evaluation for Interleaved Image-Text Generation

Yukang Feng, Jianwen Sun, Chuanhao Li, Zizhen Li, Jiaxin Ai, Fanrui Zhang, Yifan Chang, Sizhuo Zhou, Shenglin Zhang, Yu Dai, Kaipeng Zhang

2025-06-16

A High-Quality Dataset and Reliable Evaluation for Interleaved
Image-Text Generation

Summary

This paper talks about InterSyn, a large and high-quality dataset made for teaching AI models to generate connected and mixed image and text outputs. It includes many conversations where images and texts are closely linked, improved automatically by a process called SEIR to keep quality high. The paper also introduces SynJudge, a special tool that automatically checks how well the AI combines images and text, focusing on how well they work together instead of just looking similar.

What's the problem?

The problem is that existing AI models struggle to create image and text outputs that are tightly connected and make sense together, mainly because there isn't enough high-quality data that shows how images and text should interact closely. Also, current evaluation methods don’t effectively measure how well images and text complement each other, which makes it hard to improve AI in this area.

What's the solution?

The solution was to build InterSyn using a fully automated method called Self-Evaluation with Iterative Refinement (SEIR) that carefully improves questions, answers, and images in multiple steps to create detailed and coherent interleaved image-text dialogues. Alongside this, they designed SynJudge, an automatic evaluator that judges AI outputs based on four important aspects, including a unique measure of how well the image and text parts work together to convey meaning harmoniously.

Why it matters?

This matters because having better data and evaluation tools helps AI models become much smarter at understanding and generating content that mixes images and text in a smooth and meaningful way. This can improve many AI applications like chatbots, virtual assistants, and other systems that need to communicate using both pictures and words effectively.

Abstract

InterSyn, a large-scale dataset with tightly interleaved image-text outputs and automated quality refinement, improves multimodal understanding and generation through the SEIR method and SynJudge, an automatic evaluation tool.

View Paper