Diverse Video Generation with Determinantal Point Process-Guided Policy Optimization
Tahira Kazimi, Connor Dunlop, Pinar Yanardag
2025-11-26
Summary
This paper focuses on improving the variety of videos generated by artificial intelligence from text descriptions. Current AI models can create high-quality videos, but they often produce very similar results even when given the same prompt, lacking diversity.
What's the problem?
When you ask an AI to make multiple videos from the same text prompt, it tends to give you variations of the *same* video, instead of exploring the many different possibilities the prompt allows. Imagine asking for a video of 'a cat playing in a garden' and getting ten videos that all show a very similar cat, in a very similar garden, doing almost the same thing. The problem is a lack of diversity in the generated outputs.
What's the solution?
The researchers developed a new technique called DPP-GRPO. Think of it like training the AI to *want* to create different videos. It does this by giving the AI a 'reward' for generating videos that are different from each other, but still accurately reflect the original text prompt. It uses two mathematical ideas, Determinantal Point Processes and Group Relative Policy Optimization, to encourage this diversity by penalizing repetitive results and providing feedback on sets of video options. Importantly, this technique can be added to existing video generation models without major changes.
Why it matters?
This work is important because it makes AI-generated videos more useful and interesting. More diverse outputs mean you're more likely to get a video that truly matches your vision, and it opens up possibilities for creative applications where variety is key. The researchers also shared their code and a new dataset to help other researchers build on their work, accelerating progress in the field.
Abstract
While recent text-to-video (T2V) diffusion models have achieved impressive quality and prompt alignment, they often produce low-diversity outputs when sampling multiple videos from a single text prompt. We tackle this challenge by formulating it as a set-level policy optimization problem, with the goal of training a policy that can cover the diverse range of plausible outcomes for a given prompt. To address this, we introduce DPP-GRPO, a novel framework for diverse video generation that combines Determinantal Point Processes (DPPs) and Group Relative Policy Optimization (GRPO) theories to enforce explicit reward on diverse generations. Our objective turns diversity into an explicit signal by imposing diminishing returns on redundant samples (via DPP) while supplies groupwise feedback over candidate sets (via GRPO). Our framework is plug-and-play and model-agnostic, and encourages diverse generations across visual appearance, camera motions, and scene structure without sacrificing prompt fidelity or perceptual quality. We implement our method on WAN and CogVideoX, and show that our method consistently improves video diversity on state-of-the-art benchmarks such as VBench, VideoScore, and human preference studies. Moreover, we release our code and a new benchmark dataset of 30,000 diverse prompts to support future research.