Layer-Aware Video Composition via Split-then-Merge

Ozgur Kara, Yujia Chen, Ming-Hsuan Yang, James M. Rehg, Wen-Sheng Chu, Du Tran

2025-12-01

Layer-Aware Video Composition via Split-then-Merge

Summary

This paper introduces a new method called Split-then-Merge, or StM, for creating realistic videos by teaching a computer how objects move and interact with different backgrounds.

What's the problem?

Generating realistic videos is hard because it requires a lot of data showing how things move in the real world. Getting this data, where everything is labeled and categorized, is expensive and time-consuming. Existing methods either need tons of labeled data or rely on simple, manually-defined rules which don't capture the complexity of real-life movement.

What's the solution?

StM tackles this by taking a large collection of *unlabeled* videos and breaking them down into foreground objects (like a person or a car) and backgrounds. Then, it teaches the computer to put these foregrounds into different backgrounds and learn how they should realistically interact. It uses a special training process that focuses on making sure the foreground object stays looking like itself when blended into a new scene, and that the interactions make sense. Essentially, it learns by taking things apart and then putting them back together in new ways.

Why it matters?

This research is important because it allows for the creation of more realistic videos with less need for painstakingly labeled data. This could be useful for things like creating special effects, generating training data for robots, or even making more immersive virtual reality experiences. The results show this new method is better than current techniques, both when measured by computers and when judged by people and advanced AI systems.

Abstract

We present Split-then-Merge (StM), a novel framework designed to enhance control in generative video composition and address its data scarcity problem. Unlike conventional methods relying on annotated datasets or handcrafted rules, StM splits a large corpus of unlabeled videos into dynamic foreground and background layers, then self-composes them to learn how dynamic subjects interact with diverse scenes. This process enables the model to learn the complex compositional dynamics required for realistic video generation. StM introduces a novel transformation-aware training pipeline that utilizes a multi-layer fusion and augmentation to achieve affordance-aware composition, alongside an identity-preservation loss that maintains foreground fidelity during blending. Experiments show StM outperforms SoTA methods in both quantitative benchmarks and in humans/VLLM-based qualitative evaluations. More details are available at our project page: https://split-then-merge.github.io

View Paper