Canvas-to-Image: Compositional Image Generation with Multimodal Controls

Yusuf Dalva, Guocheng Gordon Qian, Maya Goldenberg, Tsai-Shien Chen, Kfir Aberman, Sergey Tulyakov, Pinar Yanardag, Kuan-Chieh Jackson Wang

2025-11-28

Canvas-to-Image: Compositional Image Generation with Multimodal Controls

Summary

This paper introduces a new way to control image generation using AI, allowing for much more precise and combined instructions than before.

What's the problem?

Current AI image generators are really good at making pictures, but they struggle when you try to give them lots of specific directions at once, like 'put this person doing this pose in this location with these other objects'. It's hard to get the AI to follow all the instructions accurately and create a cohesive image when you're mixing text, references, and spatial arrangements.

What's the solution?

The researchers created a system called Canvas-to-Image. Think of it like giving the AI a sketch or a simple visual plan. You combine all your instructions – what you want to see, where things should be, poses, etc. – into a single image 'canvas'. The AI then interprets this canvas to create the final, detailed image. They also trained the AI on a lot of different examples to help it understand how to read and use these canvases effectively, teaching it to handle multiple instructions at the same time.

Why it matters?

This is important because it makes AI image generation much more useful and controllable. Instead of needing to carefully phrase text prompts or use complicated workarounds, users can visually communicate their ideas directly to the AI, resulting in images that more closely match their vision. It's a step towards making AI a truly collaborative tool for creative tasks.

Abstract

While modern diffusion models excel at generating high-quality and diverse images, they still struggle with high-fidelity compositional and multimodal control, particularly when users simultaneously specify text prompts, subject references, spatial arrangements, pose constraints, and layout annotations. We introduce Canvas-to-Image, a unified framework that consolidates these heterogeneous controls into a single canvas interface, enabling users to generate images that faithfully reflect their intent. Our key idea is to encode diverse control signals into a single composite canvas image that the model can directly interpret for integrated visual-spatial reasoning. We further curate a suite of multi-task datasets and propose a Multi-Task Canvas Training strategy that optimizes the diffusion model to jointly understand and integrate heterogeneous controls into text-to-image generation within a unified learning paradigm. This joint training enables Canvas-to-Image to reason across multiple control modalities rather than relying on task-specific heuristics, and it generalizes well to multi-control scenarios during inference. Extensive experiments show that Canvas-to-Image significantly outperforms state-of-the-art methods in identity preservation and control adherence across challenging benchmarks, including multi-person composition, pose-controlled composition, layout-constrained generation, and multi-control generation.

View Paper