Exploring MLLM-Diffusion Information Transfer with MetaCanvas

Han Lin, Xichen Pan, Ziqi Huang, Ji Hou, Jialiang Wang, Weifeng Chen, Zecheng He, Felix Juefei-Xu, Junzhe Sun, Zhipeng Fan, Ali Thabet, Mohit Bansal, Chu Wang

2025-12-15

Exploring MLLM-Diffusion Information Transfer with MetaCanvas

Summary

This paper introduces a new method, called MetaCanvas, to improve how AI creates images and videos from text or other images. It focuses on better utilizing the reasoning abilities of advanced AI models during the creation process.

What's the problem?

Current AI systems that generate images and videos often don't fully use the power of the large language models they're built on. These language models are great at understanding complex descriptions, but when it comes to actually *making* the image or video, they're often simplified to just provide basic text information. This means the generated content can lack the precise details and structured control that the AI understands in the first place, creating a disconnect between understanding and creation.

What's the solution?

MetaCanvas is a framework that allows these powerful language models to directly plan and control the image or video creation process within the 'hidden space' where the image is being built. Instead of just giving a general text description, the AI can reason about the spatial arrangement of objects and how things change over time, and then directly influence the image generation accordingly. The researchers tested this with different image generation techniques and across various tasks like creating images from text, editing existing images, and generating videos.

Why it matters?

This work is important because it shows a promising way to bridge the gap between an AI's ability to *understand* visual information and its ability to *create* it. By letting the AI actively plan the visual details, we can expect more accurate, detailed, and controllable image and video generation in the future.

Abstract

Multimodal learning has rapidly advanced visual understanding, largely via multimodal large language models (MLLMs) that use powerful LLMs as cognitive cores. In visual generation, however, these powerful core models are typically reduced to global text encoders for diffusion models, leaving most of their reasoning and planning ability unused. This creates a gap: current multimodal LLMs can parse complex layouts, attributes, and knowledge-intensive scenes, yet struggle to generate images or videos with equally precise and structured control. We propose MetaCanvas, a lightweight framework that lets MLLMs reason and plan directly in spatial and spatiotemporal latent spaces and interface tightly with diffusion generators. We empirically implement MetaCanvas on three different diffusion backbones and evaluate it across six tasks, including text-to-image generation, text/image-to-video generation, image/video editing, and in-context video generation, each requiring precise layouts, robust attribute binding, and reasoning-intensive control. MetaCanvas consistently outperforms global-conditioning baselines, suggesting that treating MLLMs as latent-space planners is a promising direction for narrowing the gap between multimodal understanding and generation.

View Paper