WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation

Wei Chow, Jiachun Pan, Yongyuan Liang, Mingze Zhou, Xue Song, Liyu Jia, Saining Zhang, Siliang Tang, Juncheng Li, Fengda Zhang, Weijia Wu, Hanwang Zhang, Tat-Seng Chua

2025-11-17

WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation

Summary

This paper introduces WEAVE, a new resource designed to test and improve how well artificial intelligence models can understand and create images based on ongoing conversations.

What's the problem?

Current AI models that work with both images and text are really good at responding to single requests, like 'describe this picture' or 'edit this image to add a cat'. However, real-world interactions are rarely just one question and answer; they're usually a series of back-and-forth exchanges where the AI needs to remember what was said before to understand the current request. Existing datasets don't really challenge AI in this way, so it's hard to build models that can handle these more complex, multi-turn conversations about images.

What's the solution?

The researchers created WEAVE, which has two main parts. First, WEAVE-100k is a huge collection of 100,000 examples of conversations about images, totaling over 370,000 individual turns and 500,000 images. These conversations involve understanding images, editing them, and creating new ones, all while keeping track of the previous discussion. Second, WEAVEBench is a set of 100 specific tasks with human evaluations to rigorously test how well models perform on these multi-turn image tasks, looking at things like memory and reasoning. They also developed a special way to judge the AI's responses, considering both the original image and the editing instructions.

Why it matters?

WEAVE is important because it provides a realistic way to evaluate and improve AI models that deal with images and text. By forcing models to handle ongoing conversations, it pushes them to develop better memory and reasoning skills, ultimately leading to AI that can more naturally and effectively interact with humans about visual content. It also highlights where current AI still struggles, paving the way for future research in this area.

Abstract

Recent advances in unified multimodal models (UMMs) have enabled impressive progress in visual comprehension and generation. However, existing datasets and benchmarks focus primarily on single-turn interactions, failing to capture the multi-turn, context-dependent nature of real-world image creation and editing. To address this gap, we present WEAVE, the first suite for in-context interleaved cross-modality comprehension and generation. Our suite consists of two complementary parts. WEAVE-100k is a large-scale dataset of 100K interleaved samples spanning over 370K dialogue turns and 500K images, covering comprehension, editing, and generation tasks that require reasoning over historical context. WEAVEBench is a human-annotated benchmark with 100 tasks based on 480 images, featuring a hybrid VLM judger evaluation framework based on both the reference image and the combination of the original image with editing instructions that assesses models' abilities in multi-turn generation, visual memory, and world-knowledge reasoning across diverse domains. Experiments demonstrate that training on WEAVE-100k enables vision comprehension, image editing, and comprehension-generation collaboration capabilities. Furthermore, it facilitates UMMs to develop emergent visual-memory capabilities, while extensive evaluations on WEAVEBench expose the persistent limitations and challenges of current approaches in multi-turn, context-aware image generation and editing. We believe WEAVE provides a view and foundation for studying in-context interleaved comprehension and generation for multi-modal community.

View Paper