ImagerySearch: Adaptive Test-Time Search for Video Generation Beyond Semantic Dependency Constraints
Meiqi Wu, Jiashu Zhu, Xiaokun Feng, Chubin Chen, Chen Zhu, Bingze Song, Fangyuan Mao, Jiahong Wu, Xiangxiang Chu, Kaiqi Huang
2025-10-17
Summary
This paper focuses on improving how well AI models create videos from text descriptions, specifically when those descriptions ask for something imaginative or unusual.
What's the problem?
Current video generation models are really good at making videos of things they've 'seen' a lot during training, like everyday scenes. However, they struggle when asked to create videos of things that don't usually go together or are very abstract, because those scenarios weren't well represented in their training data. Existing methods try to improve video quality after the model generates it, but they aren't flexible enough to handle these complex, imaginative requests.
What's the solution?
The researchers developed a new technique called ImagerySearch. This method doesn't just make a video and then try to fix it; instead, it actively searches for the best way to create the video *while* it's being generated. It does this by understanding the relationships between the different ideas in the text description and adjusting its approach accordingly. Essentially, it dynamically changes how it looks for the right images and how it judges whether the video is good, based on what the prompt is asking for.
Why it matters?
This work is important because it pushes the boundaries of what AI can create. By making models better at handling imaginative prompts, we can unlock new possibilities for creative content generation, like personalized stories, unique art, and more. The researchers also created a new benchmark, LDT-Bench, to specifically test and measure progress in this area, which will help other researchers build even better models in the future.
Abstract
Video generation models have achieved remarkable progress, particularly excelling in realistic scenarios; however, their performance degrades notably in imaginative scenarios. These prompts often involve rarely co-occurring concepts with long-distance semantic relationships, falling outside training distributions. Existing methods typically apply test-time scaling for improving video quality, but their fixed search spaces and static reward designs limit adaptability to imaginative scenarios. To fill this gap, we propose ImagerySearch, a prompt-guided adaptive test-time search strategy that dynamically adjusts both the inference search space and reward function according to semantic relationships in the prompt. This enables more coherent and visually plausible videos in challenging imaginative settings. To evaluate progress in this direction, we introduce LDT-Bench, the first dedicated benchmark for long-distance semantic prompts, consisting of 2,839 diverse concept pairs and an automated protocol for assessing creative generation capabilities. Extensive experiments show that ImagerySearch consistently outperforms strong video generation baselines and existing test-time scaling approaches on LDT-Bench, and achieves competitive improvements on VBench, demonstrating its effectiveness across diverse prompt types. We will release LDT-Bench and code to facilitate future research on imaginative video generation.