Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

Ang Li, Charles Wang, Kaiyu Yue, Zikui Cai, Ollie Liu, Deqing Fu, Peng Guo, Wang Bill Zhu, Vatsal Sharan, Robin Jia, Willie Neiswanger, Furong Huang, Tom Goldstein, Micah Goldblum

2025-07-23

Zebra-CoT: A Dataset for Interleaved Vision Language Reasoning

Summary

This paper talks about Zebra-CoT, a large dataset created to help AI models improve at combining vision and language reasoning by practicing with many different types of tasks involving images and text.

What's the problem?

AI models often struggle to think through problems step-by-step when they have to combine understanding of both pictures and language, limiting how well they can explain their thoughts and make accurate decisions.

What's the solution?

The researchers built Zebra-CoT with a variety of tasks that mix images and text, encouraging models to generate a visual chain of thought, which means explaining their reasoning clearly as they solve problems. Fine-tuning models on this dataset helps improve their accuracy and reasoning ability.

Why it matters?

This matters because better vision-language reasoning allows AI to understand and explain complex real-world situations more clearly, improving applications in education, robotics, and any area where AI must interpret images and language together.

Abstract

Zebra-CoT, a large-scale dataset with diverse visual and text reasoning tasks, improves multimodal model performance through fine-tuning, enhancing accuracy and visual chain of thought generation.

View Paper