DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

Jiawei Liu, Junqiao Li, Jiangfan Deng, Gen Li, Siyu Zhou, Zetao Fang, Shanshan Lao, Zengde Deng, Jianing Zhu, Tingting Ma, Jiayi Li, Yunqiu Wang, Qian He, Xinglong Wu

2025-12-25

DreaMontage: Arbitrary Frame-Guided One-Shot Video Generation

Summary

This paper introduces a new system called DreaMontage that creates impressive, continuous 'one-shot' videos – those long takes that look like they were filmed in a single, unbroken shot – using artificial intelligence.

What's the problem?

Creating true one-shot videos is really expensive and difficult in the real world. While AI video generators exist, simply stitching together clips doesn't look smooth or realistic; it often appears choppy and unnatural, lacking the seamless flow of a real one-shot take.

What's the solution?

The researchers tackled this by improving an existing AI model (DiT) in three key ways. First, they made it better at following specific instructions about what should happen in each frame of the video. Second, they trained the AI using high-quality video examples and a special technique to make the videos look more visually appealing and believable, especially regarding how things move and how scenes transition. Finally, they developed a way for the AI to create longer videos without running out of computer memory.

Why it matters?

DreaMontage makes it possible for anyone to create professional-looking, cinematic one-shot videos without the huge costs and logistical challenges of traditional filmmaking. This opens up creative possibilities for filmmakers, artists, and even everyday users who want to tell stories in a visually dynamic way, turning fragmented footage into a cohesive and captivating experience.

Abstract

The "one-shot" technique represents a distinct and sophisticated aesthetic in filmmaking. However, its practical realization is often hindered by prohibitive costs and complex real-world constraints. Although emerging video generation models offer a virtual alternative, existing approaches typically rely on naive clip concatenation, which frequently fails to maintain visual smoothness and temporal coherence. In this paper, we introduce DreaMontage, a comprehensive framework designed for arbitrary frame-guided generation, capable of synthesizing seamless, expressive, and long-duration one-shot videos from diverse user-provided inputs. To achieve this, we address the challenge through three primary dimensions. (i) We integrate a lightweight intermediate-conditioning mechanism into the DiT architecture. By employing an Adaptive Tuning strategy that effectively leverages base training data, we unlock robust arbitrary-frame control capabilities. (ii) To enhance visual fidelity and cinematic expressiveness, we curate a high-quality dataset and implement a Visual Expression SFT stage. In addressing critical issues such as subject motion rationality and transition smoothness, we apply a Tailored DPO scheme, which significantly improves the success rate and usability of the generated content. (iii) To facilitate the production of extended sequences, we design a Segment-wise Auto-Regressive (SAR) inference strategy that operates in a memory-efficient manner. Extensive experiments demonstrate that our approach achieves visually striking and seamlessly coherent one-shot effects while maintaining computational efficiency, empowering users to transform fragmented visual materials into vivid, cohesive one-shot cinematic experiences.

View Paper