Pico-Banana-400K: A Large-Scale Dataset for Text-Guided Image Editing
Yusu Qian, Eli Bocek-Rivele, Liangchen Song, Jialing Tong, Yinfei Yang, Jiasen Lu, Wenze Hu, Zhe Gan
2025-10-23
Summary
This paper introduces Pico-Banana-400K, a new dataset designed to help researchers improve how computers edit images based on text instructions.
What's the problem?
Current AI models are getting good at editing images with text prompts, like those seen in GPT-4o and Nano-Banana, but progress is limited because there aren't enough large, high-quality datasets of *real* images available for training and testing these models. Existing datasets are often created artificially and don't fully represent the complexities of real-world images and editing tasks.
What's the solution?
The researchers created Pico-Banana-400K by starting with a large collection of real photographs and then using another AI model, Nano-Banana, to generate pairs of images showing edits. They didn't just randomly generate edits, though. They carefully categorized the types of edits to make sure the dataset covers a wide range of possibilities, and they used another AI model to check the quality of the edits, ensuring the changes were accurate and followed the instructions well. They also created special subsets of the data for more complex editing scenarios, like making multiple edits in a sequence or comparing different editing results.
Why it matters?
Pico-Banana-400K provides a valuable resource for the AI community. Having a large, high-quality dataset of real images with detailed editing instructions will allow researchers to train and evaluate more advanced image editing models, ultimately leading to better and more reliable AI tools for image manipulation and creation.
Abstract
Recent advances in multimodal models have demonstrated remarkable text-guided image editing capabilities, with systems like GPT-4o and Nano-Banana setting new benchmarks. However, the research community's progress remains constrained by the absence of large-scale, high-quality, and openly accessible datasets built from real images. We introduce Pico-Banana-400K, a comprehensive 400K-image dataset for instruction-based image editing. Our dataset is constructed by leveraging Nano-Banana to generate diverse edit pairs from real photographs in the OpenImages collection. What distinguishes Pico-Banana-400K from previous synthetic datasets is our systematic approach to quality and diversity. We employ a fine-grained image editing taxonomy to ensure comprehensive coverage of edit types while maintaining precise content preservation and instruction faithfulness through MLLM-based quality scoring and careful curation. Beyond single turn editing, Pico-Banana-400K enables research into complex editing scenarios. The dataset includes three specialized subsets: (1) a 72K-example multi-turn collection for studying sequential editing, reasoning, and planning across consecutive modifications; (2) a 56K-example preference subset for alignment research and reward model training; and (3) paired long-short editing instructions for developing instruction rewriting and summarization capabilities. By providing this large-scale, high-quality, and task-rich resource, Pico-Banana-400K establishes a robust foundation for training and benchmarking the next generation of text-guided image editing models.