iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

Zhoujie Fu, Xianfang Zeng, Jinghong Lan, Xinyao Liao, Cheng Chen, Junyi Chen, Jiacheng Wei, Wei Cheng, Shiyu Liu, Yunuo Chen, Gang Yu, Guosheng Lin

2025-11-26

iMontage: Unified, Versatile, Highly Dynamic Many-to-many Image Generation

Summary

This paper introduces iMontage, a new system that uses existing video-generating AI to create much more diverse and dynamic sets of images than previously possible.

What's the problem?

Current AI models that generate videos are really good at making things look like they're moving naturally, but they often lack variety in what they can actually *show*. They're limited by the kinds of videos they were originally trained on. Essentially, they're good at smooth transitions, but not at showing a huge range of different scenes and actions.

What's the solution?

The researchers took a powerful video AI and adapted it to work with still images instead. They did this by carefully changing how the AI processes information and by feeding it a large collection of diverse images. This allows the AI to create sequences of images that have both natural-looking transitions *and* a much wider variety of content, effectively turning a video generator into a versatile image generator.

Why it matters?

This is important because it opens up possibilities for creating more interesting and complex images for things like art, design, and even special effects. iMontage can generate scenes with a level of detail and dynamism that wasn't achievable before, and it does so without losing the realistic motion qualities of the original video AI.

Abstract

Pre-trained video models learn powerful priors for generating high-quality, temporally coherent content. While these models excel at temporal coherence, their dynamics are often constrained by the continuous nature of their training data. We hypothesize that by injecting the rich and unconstrained content diversity from image data into this coherent temporal framework, we can generate image sets that feature both natural transitions and a far more expansive dynamic range. To this end, we introduce iMontage, a unified framework designed to repurpose a powerful video model into an all-in-one image generator. The framework consumes and produces variable-length image sets, unifying a wide array of image generation and editing tasks. To achieve this, we propose an elegant and minimally invasive adaptation strategy, complemented by a tailored data curation process and training paradigm. This approach allows the model to acquire broad image manipulation capabilities without corrupting its invaluable original motion priors. iMontage excels across several mainstream many-in-many-out tasks, not only maintaining strong cross-image contextual consistency but also generating scenes with extraordinary dynamics that surpass conventional scopes. Find our homepage at: https://kr1sjfu.github.io/iMontage-web/.

View Paper