ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation

Ruihang Xu, Dewei Zhou, Fan Ma, Yi Yang

2025-10-15

ContextGen: Contextual Layout Anchoring for Identity-Consistent Multi-Instance Generation

Summary

This paper introduces a new system called ContextGen that improves how AI creates images with multiple specific objects in them, based on instructions about where those objects should be and what they should look like.

What's the problem?

Current AI image generators struggle when asked to create images with several distinct objects placed in specific locations, and often have trouble making sure each object looks like it's supposed to, especially when you want multiple versions of the same type of object. It's hard to precisely control the layout and maintain the unique identity of each object in the image.

What's the solution?

The researchers developed ContextGen, which uses a special 'anchor' to lock objects into their desired positions using a layout image, and a new type of attention mechanism to make sure each object maintains its correct appearance based on reference images. They also created a new, large dataset called IMIG-100K with detailed information about object layouts and identities to help train and test their system.

Why it matters?

This work is important because it pushes the boundaries of what AI can create visually. Better control over object placement and identity opens up possibilities for more realistic and customized image generation, which could be useful in fields like design, entertainment, and even creating training data for other AI systems.

Abstract

Multi-instance image generation (MIG) remains a significant challenge for modern diffusion models due to key limitations in achieving precise control over object layout and preserving the identity of multiple distinct subjects. To address these limitations, we introduce ContextGen, a novel Diffusion Transformer framework for multi-instance generation that is guided by both layout and reference images. Our approach integrates two key technical contributions: a Contextual Layout Anchoring (CLA) mechanism that incorporates the composite layout image into the generation context to robustly anchor the objects in their desired positions, and Identity Consistency Attention (ICA), an innovative attention mechanism that leverages contextual reference images to ensure the identity consistency of multiple instances. Recognizing the lack of large-scale, hierarchically-structured datasets for this task, we introduce IMIG-100K, the first dataset with detailed layout and identity annotations. Extensive experiments demonstrate that ContextGen sets a new state-of-the-art, outperforming existing methods in control precision, identity fidelity, and overall visual quality.

View Paper