RAPO++: Cross-Stage Prompt Optimization for Text-to-Video Generation via Data Alignment and Test-Time Scaling
Bingjie Gao, Qianli Ma, Xiaoxue Wu, Shuai Yang, Guanzhou Lan, Haonan Zhao, Jiaxuan Chen, Qingyang Liu, Yu Qiao, Xinyuan Chen, Yaohui Wang, Li Niu
2025-10-27
Summary
This paper introduces a new method, RAPO++, to improve how text-to-video AI models understand and act on user instructions, ultimately creating better videos from text prompts.
What's the problem?
Current text-to-video AI models struggle because people often give short, unclear, or poorly worded instructions. These prompts don't match the kind of detailed descriptions the AI was trained on, leading to videos that don't quite capture what the user intended, especially when it comes to complex scenes or actions.
What's the solution?
RAPO++ tackles this in three stages. First, it expands on the user's prompt by adding relevant details and re-writing it to be more like the prompts used during the AI's training. Second, it repeatedly refines the prompt while *generating* the video, using feedback about how well the video matches the desired scene, movement, and overall realism. Finally, it uses these improved prompt examples to teach the AI's 'prompt interpreter' to better understand and generate good prompts on its own, even before creating a video.
Why it matters?
RAPO++ is important because it significantly improves the quality of videos generated from text, without needing to change the core AI model itself. It's a flexible, cost-effective way to get better results from existing text-to-video technology, making it easier for anyone to create videos simply by describing what they want.
Abstract
Prompt design plays a crucial role in text-to-video (T2V) generation, yet user-provided prompts are often short, unstructured, and misaligned with training data, limiting the generative potential of diffusion-based T2V models. We present RAPO++, a cross-stage prompt optimization framework that unifies training-data--aligned refinement, test-time iterative scaling, and large language model (LLM) fine-tuning to substantially improve T2V generation without modifying the underlying generative backbone. In Stage 1, Retrieval-Augmented Prompt Optimization (RAPO) enriches user prompts with semantically relevant modifiers retrieved from a relation graph and refactors them to match training distributions, enhancing compositionality and multi-object fidelity. Stage 2 introduces Sample-Specific Prompt Optimization (SSPO), a closed-loop mechanism that iteratively refines prompts using multi-source feedback -- including semantic alignment, spatial fidelity, temporal coherence, and task-specific signals such as optical flow -- yielding progressively improved video generation quality. Stage 3 leverages optimized prompt pairs from SSPO to fine-tune the rewriter LLM, internalizing task-specific optimization patterns and enabling efficient, high-quality prompt generation even before inference. Extensive experiments across five state-of-the-art T2V models and five benchmarks demonstrate that RAPO++ achieves significant gains in semantic alignment, compositional reasoning, temporal stability, and physical plausibility, outperforming existing methods by large margins. Our results highlight RAPO++ as a model-agnostic, cost-efficient, and scalable solution that sets a new standard for prompt optimization in T2V generation. The code is available at https://github.com/Vchitect/RAPO.