MACRO: Advancing Multi-Reference Image Generation with Structured Long-Context Data
Zhekai Chen, Yuqing Wang, Manyuan Zhang, Xihui Liu
2026-03-28
Summary
This paper focuses on improving how well computers can create images based on multiple example images provided as guidance, like combining different subjects into one scene or showing different viewpoints of the same object.
What's the problem?
Currently, image generation models struggle when you give them many example images to work with. The main issue is that the datasets used to train these models don't have enough examples with lots of reference images, meaning the models haven't learned how to effectively understand and combine information from several sources. They're good with one or two examples, but fall apart with more.
What's the solution?
The researchers created a new, large dataset called MacroData, containing 400,000 images, each with up to 10 reference images. This dataset is designed to cover different types of multi-image tasks, like customizing images, illustrating stories, understanding spatial relationships, and showing changes over time. They also created MacroBench, a standardized way to test how well these models perform on these tasks. By training models on MacroData, they showed significant improvements in generating images from multiple references.
Why it matters?
This work is important because it addresses a key limitation in current image generation technology. Being able to effectively use multiple reference images opens up possibilities for more complex and realistic image creation, which is crucial for applications like creating detailed illustrations, designing scenes with many objects, and generating views of objects from different angles. The new dataset and benchmark will help push the field forward and allow for better evaluation of these models.
Abstract
Generating images conditioned on multiple visual references is critical for real-world applications such as multi-subject composition, narrative illustration, and novel view synthesis, yet current models suffer from severe performance degradation as the number of input references grows. We identify the root cause as a fundamental data bottleneck: existing datasets are dominated by single- or few-reference pairs and lack the structured, long-context supervision needed to learn dense inter-reference dependencies. To address this, we introduce MacroData, a large-scale dataset of 400K samples, each containing up to 10 reference images, systematically organized across four complementary dimensions -- Customization, Illustration, Spatial reasoning, and Temporal dynamics -- to provide comprehensive coverage of the multi-reference generation space. Recognizing the concurrent absence of standardized evaluation protocols, we further propose MacroBench, a benchmark of 4,000 samples that assesses generative coherence across graded task dimensions and input scales. Extensive experiments show that fine-tuning on MacroData yields substantial improvements in multi-reference generation, and ablation studies further reveal synergistic benefits of cross-task co-training and effective strategies for handling long-context complexity. The dataset and benchmark will be publicly released.