VISTA: A Test-Time Self-Improving Video Generation Agent
Do Xuan Long, Xingchen Wan, Hootan Nakhost, Chen-Yu Lee, Tomas Pfister, Sercan Ö. Arık
2025-10-20
Summary
This paper introduces a new system called VISTA that automatically improves the quality of videos created from text descriptions, making them more closely match what the user intended.
What's the problem?
Creating good videos from text is hard because the videos are very sensitive to the exact wording of the text prompt. Existing methods that try to improve things by tweaking the prompt during the video creation process don't work well for videos, likely because videos have many different aspects – visuals, sound, and how things change over time – that all need to be considered. It's a much more complex problem than just improving images, for example.
What's the solution?
VISTA works by breaking down the user's idea into a step-by-step plan. It then generates a video, picks the best version from several attempts, and then uses three 'agent' programs to critique it. One agent focuses on how the video looks, another on the sound, and the third on whether the video makes sense overall. A final 'reasoning' agent combines this feedback and rewrites the original text prompt to be more specific and detailed. This process repeats, creating better and better videos with each cycle.
Why it matters?
This research is important because it makes text-to-video generation much more reliable and user-friendly. Instead of needing to be a prompt expert to get a good video, VISTA can automatically refine the instructions, leading to higher quality videos that better reflect the user’s vision. Tests show VISTA consistently outperforms other methods and people generally prefer the videos it creates.
Abstract
Despite rapid advances in text-to-video synthesis, generated video quality remains critically dependent on precise user prompts. Existing test-time optimization methods, successful in other domains, struggle with the multi-faceted nature of video. In this work, we introduce VISTA (Video Iterative Self-improvemenT Agent), a novel multi-agent system that autonomously improves video generation through refining prompts in an iterative loop. VISTA first decomposes a user idea into a structured temporal plan. After generation, the best video is identified through a robust pairwise tournament. This winning video is then critiqued by a trio of specialized agents focusing on visual, audio, and contextual fidelity. Finally, a reasoning agent synthesizes this feedback to introspectively rewrite and enhance the prompt for the next generation cycle. Experiments on single- and multi-scene video generation scenarios show that while prior methods yield inconsistent gains, VISTA consistently improves video quality and alignment with user intent, achieving up to 60% pairwise win rate against state-of-the-art baselines. Human evaluators concur, preferring VISTA outputs in 66.4% of comparisons.