MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation

Yuta Oshima, Daiki Miyake, Kohsei Matsutani, Yusuke Iwasawa, Masahiro Suzuki, Yutaka Matsuo, Hiroki Furuta

2025-12-02

MultiBanana: A Challenging Benchmark for Multi-Reference Text-to-Image Generation

Summary

This paper introduces a new way to test how well AI models can create images from multiple descriptions and example pictures, focusing on how they combine different visual elements and understand instructions.

What's the problem?

Current tests for text-to-image AI models usually only give them one or a few example images to work with, which doesn't really show how they handle more complex situations where they need to blend information from many different sources. Also, these tests don't clearly define *what* makes combining multiple images difficult – things like if the images are very different styles, different sizes, or contain unusual objects aren't properly evaluated.

What's the solution?

The researchers created a new dataset called MultiBanana specifically designed to challenge AI models with these complex multi-image scenarios. MultiBanana includes variations like using different numbers of example images, mixing images of different styles (like photos and anime), dealing with images where the objects are at different scales, using images with rare or unusual things in them, and even using descriptions in multiple languages. They then tested several AI models on this dataset to see how they performed.

Why it matters?

This new dataset, MultiBanana, is important because it provides a standardized and more difficult way to measure the progress of text-to-image AI models. It helps pinpoint where these models struggle and what areas need improvement, ultimately pushing the field forward and allowing for fairer comparisons between different models.

Abstract

Recent text-to-image generation models have acquired the ability of multi-reference generation and editing; the ability to inherit the appearance of subjects from multiple reference images and re-render them under new contexts. However, the existing benchmark datasets often focus on the generation with single or a few reference images, which prevents us from measuring the progress on how model performance advances or pointing out their weaknesses, under different multi-reference conditions. In addition, their task definitions are still vague, typically limited to axes such as "what to edit" or "how many references are given", and therefore fail to capture the intrinsic difficulty of multi-reference settings. To address this gap, we introduce MultiBanana, which is carefully designed to assesses the edge of model capabilities by widely covering multi-reference-specific problems at scale: (1) varying the number of references, (2) domain mismatch among references (e.g., photo vs. anime), (3) scale mismatch between reference and target scenes, (4) references containing rare concepts (e.g., a red banana), and (5) multilingual textual references for rendering. Our analysis among a variety of text-to-image models reveals their superior performances, typical failure modes, and areas for improvement. MultiBanana will be released as an open benchmark to push the boundaries and establish a standardized basis for fair comparison in multi-reference image generation. Our data and code are available at https://github.com/matsuolab/multibanana .

View Paper