OmniCreator: Self-Supervised Unified Generation with Universal Editing
Haodong Chen, Lan Wang, Harry Yang, Ser-Nam Lim
2024-12-04

Summary
This paper introduces OmniCreator, a new framework that allows users to generate and edit both images and videos using text prompts, all in one system.
What's the problem?
Creating and editing videos and images often requires different tools and processes. Existing methods usually focus on either generating content or editing it, but they can be limited in flexibility and may require specific controls or settings. This makes it challenging for users who want a seamless way to create high-quality multimedia content based on simple text instructions.
What's the solution?
OmniCreator solves this problem by combining the generation and editing of images and videos into a single framework. It learns to connect text descriptions with corresponding visual content through a self-supervised approach, meaning it can improve its understanding without needing extensive labeled data. When given a text prompt along with a video, OmniCreator can create new content that aligns with both the text and the video. If only a text prompt is provided, it can generate high-quality videos or images based solely on that prompt. This unified approach allows for more creative freedom without being restricted to specific editing types.
Why it matters?
This research is important because it simplifies the process of creating and editing multimedia content, making advanced tools more accessible to users. By allowing for both image and video generation from text prompts, OmniCreator can enhance creativity in fields like filmmaking, advertising, and social media, enabling users to produce professional-quality content more easily.
Abstract
We introduce OmniCreator, a novel framework that can conduct text-prompted unified (image+video) generation as well as editing all in one place. OmniCreator acquires generative and universal editing capabilities in a self-supervised manner, taking original text-video pairs as conditions while utilizing the same video as a denoising target to learn the semantic correspondence between video and text. During inference, when presented with a text prompt and a video, OmniCreator is capable of generating a target that is faithful to both, achieving a universal editing effect that is unconstrained as opposed to existing editing work that primarily focuses on certain editing types or relies on additional controls (e.g., structural conditions, attention features, or DDIM inversion). On the other hand, when presented with a text prompt only, OmniCreator becomes generative, producing high-quality video as a result of the semantic correspondence learned. Importantly, we found that the same capabilities extend to images as is, making OmniCreator a truly unified framework. Further, due to the lack of existing generative video editing benchmarks, we introduce the OmniBench-99 dataset, designed to evaluate the performance of generative video editing models comprehensively. Extensive experiments demonstrate that OmniCreator exhibits substantial superiority over all other models.