MultiRef: Controllable Image Generation with Multiple Visual References

Ruoxi Chen, Dongping Chen, Siyuan Wu, Sinan Wang, Shiyun Lang, Petr Sushko, Gaoyang Jiang, Yao Wan, Ranjay Krishna

2025-08-20

MultiRef: Controllable Image Generation with Multiple Visual References

Summary

This paper discusses how to generate images by using more than one reference image, which is how real designers work, unlike current AI that usually only uses one reference. They created a new way to test these AI systems and found that even the best ones aren't very good at handling multiple references, but their work gives ideas for making better creative AI.

What's the problem?

Current AI tools for creating images typically only let you give them one text description or one example picture to work from. However, human artists and designers often pull ideas from many different sources and combine them to create something new. This limitation in AI makes it harder for them to be as flexible and creative as humans.

What's the solution?

To address this, the researchers developed a new framework called MultiRef-bench to properly test AI systems that try to generate images from multiple visual references. They also created a tool called RefBlend to generate special test data, and a larger dataset called MultiRef with thousands of high-quality images. They then tested several popular image generation AI systems using these new resources and found they struggled with combining information from multiple references.

Why it matters?

This research is important because it highlights a gap in current AI capabilities for creative tasks. By showing that AI has trouble with multi-reference image generation, it provides clear goals for future development. This work aims to lead to AI tools that are more like human designers, capable of blending inspiration from various sources to create more nuanced and sophisticated artwork.

Abstract

Visual designers naturally draw inspiration from multiple visual references, combining diverse elements and aesthetic principles to create artwork. However, current image generative frameworks predominantly rely on single-source inputs -- either text prompts or individual reference images. In this paper, we focus on the task of controllable image generation using multiple visual references. We introduce MultiRef-bench, a rigorous evaluation framework comprising 990 synthetic and 1,000 real-world samples that require incorporating visual content from multiple reference images. The synthetic samples are synthetically generated through our data engine RefBlend, with 10 reference types and 33 reference combinations. Based on RefBlend, we further construct a dataset MultiRef containing 38k high-quality images to facilitate further research. Our experiments across three interleaved image-text models (i.e., OmniGen, ACE, and Show-o) and six agentic frameworks (e.g., ChatDiT and LLM + SD) reveal that even state-of-the-art systems struggle with multi-reference conditioning, with the best model OmniGen achieving only 66.6% in synthetic samples and 79.0% in real-world cases on average compared to the golden answer. These findings provide valuable directions for developing more flexible and human-like creative tools that can effectively integrate multiple sources of visual inspiration. The dataset is publicly available at: https://multiref.github.io/.

View Paper