ROICtrl: Boosting Instance Control for Visual Generation

Yuchao Gu, Yipin Zhou, Yunfan Ye, Yixin Nie, Licheng Yu, Pingchuan Ma, Kevin Qinghong Lin, Mike Zheng Shou

2024-11-27

ROICtrl: Boosting Instance Control for Visual Generation

Summary

This paper introduces ROICtrl, a new method that improves how visual generation models can accurately place and control multiple objects in images based on text descriptions.

What's the problem?

Current visual generation models often struggle to understand and accurately represent multiple objects in a scene when described by natural language. This limitation makes it difficult to create complex images that include several distinct items, as the models tend to focus on just a few main objects instead of managing all the details effectively.

What's the solution?

ROICtrl enhances existing diffusion models by using a technique called regional instance control. It allows users to specify where each object should be placed in an image using bounding boxes and allows for detailed descriptions of each object's attributes. The method combines two operations: ROI-Align, which helps identify specific areas in the image, and a new technique called ROI-Unpool, which accurately reintegrates these areas back into the final image. This combination enables the model to manage multiple objects more efficiently and accurately.

Why it matters?

This research is significant because it expands the capabilities of visual generation models, allowing them to create more detailed and complex images based on user input. By improving how models handle multiple instances in an image, ROICtrl can enhance applications in fields like graphic design, gaming, and virtual reality, making it easier for creators to generate the visuals they envision.

Abstract

Natural language often struggles to accurately associate positional and attribute information with multiple instances, which limits current text-based visual generation models to simpler compositions featuring only a few dominant instances. To address this limitation, this work enhances diffusion models by introducing regional instance control, where each instance is governed by a bounding box paired with a free-form caption. Previous methods in this area typically rely on implicit position encoding or explicit attention masks to separate regions of interest (ROIs), resulting in either inaccurate coordinate injection or large computational overhead. Inspired by ROI-Align in object detection, we introduce a complementary operation called ROI-Unpool. Together, ROI-Align and ROI-Unpool enable explicit, efficient, and accurate ROI manipulation on high-resolution feature maps for visual generation. Building on ROI-Unpool, we propose ROICtrl, an adapter for pretrained diffusion models that enables precise regional instance control. ROICtrl is compatible with community-finetuned diffusion models, as well as with existing spatial-based add-ons (\eg, ControlNet, T2I-Adapter) and embedding-based add-ons (\eg, IP-Adapter, ED-LoRA), extending their applications to multi-instance generation. Experiments show that ROICtrl achieves superior performance in regional instance control while significantly reducing computational costs.

View Paper