Agent Banana: High-Fidelity Image Editing with Agentic Thinking and Tooling
Ruijie Ye, Jiayi Zhang, Zhuoxin Liu, Zihao Zhu, Siyuan Yang, Li Li, Tianfu Fu, Franck Dernoncourt, Yue Zhao, Jiacheng Zhu, Ryan Rossi, Wenhao Chai, Zhengzhong Tu
2026-02-11
Summary
This paper focuses on making image editing with AI instructions more useful for professionals, like graphic designers or photographers. It tackles the issues that currently prevent AI from being a reliable tool in these kinds of workflows.
What's the problem?
Currently, AI image editing tools struggle in a few key areas. First, they often change too much of the image, going beyond what the user actually asked for. Second, most tools can only handle one instruction at a time, and making multiple changes in a row can mess up the details of objects in the image. Finally, AI is usually tested on smaller images, but professionals work with very high-resolution images, and the AI doesn't perform as well when scaled up.
What's the solution?
The researchers created a system called Agent Banana. This system works in two main steps. First, it remembers past instructions and uses that information to make sure future edits fit with what’s already been done – this is called 'Context Folding'. Second, it edits the image in layers, focusing only on the areas that need to be changed and leaving the rest untouched, allowing it to work with high-resolution images without losing quality – this is called 'Image Layer Decomposition'. They also created a new set of test images, called HDD-Bench, that are very high resolution and require multiple editing steps to properly evaluate the system.
Why it matters?
This work is important because it moves AI image editing closer to being a practical tool for professionals. By addressing the issues of over-editing, handling multiple instructions, and working with high-resolution images, Agent Banana represents a step towards more reliable and useful AI assistance in creative workflows.
Abstract
We study instruction-based image editing under professional workflows and identify three persistent challenges: (i) editors often over-edit, modifying content beyond the user's intent; (ii) existing models are largely single-turn, while multi-turn edits can alter object faithfulness; and (iii) evaluation at around 1K resolution is misaligned with real workflows that often operate on ultra high-definition images (e.g., 4K). We propose Agent Banana, a hierarchical agentic planner-executor framework for high-fidelity, object-aware, deliberative editing. Agent Banana introduces two key mechanisms: (1) Context Folding, which compresses long interaction histories into structured memory for stable long-horizon control; and (2) Image Layer Decomposition, which performs localized layer-based edits to preserve non-target regions while enabling native-resolution outputs. To support rigorous evaluation, we build HDD-Bench, a high-definition, dialogue-based benchmark featuring verifiable stepwise targets and native 4K images (11.8M pixels) for diagnosing long-horizon failures. On HDD-Bench, Agent Banana achieves the best multi-turn consistency and background fidelity (e.g., IC 0.871, SSIM-OM 0.84, LPIPS-OM 0.12) while remaining competitive on instruction following, and also attains strong performance on standard single-turn editing benchmarks. We hope this work advances reliable, professional-grade agentic image editing and its integration into real workflows.