Towards Universal Video Retrieval: Generalizing Video Embedding via Synthesized Multimodal Pyramid Curriculum
Zhuoning Guo, Mingxin Li, Yanzhao Zhang, Dingkun Long, Pengjun Xie, Xiaowen Chu
2025-11-04
Summary
This paper addresses a major issue in how computers 'search' for videos: current methods are too focused on doing well on specific, limited tests, which prevents them from becoming truly good at understanding and retrieving videos in a general sense.
What's the problem?
The way we currently evaluate video search systems encourages developers to create models that only excel at a few very specific tasks. Because the tests are narrow, the models don't learn to handle the wide variety of videos and search requests that exist in the real world. There's a lack of a good way to test if a video search system can *really* understand videos and find what you're looking for, even if it's a complex or unusual request.
What's the solution?
The researchers created a new system that tackles this problem by improving how we evaluate, collect data for, and build video search models all at the same time. They built a large and diverse set of tests, called the Universal Video Retrieval Benchmark, to thoroughly assess a model's abilities. Then, they used this benchmark to guide the creation of a huge dataset of videos and search queries. Finally, they developed a new model, the General Video Embedder, and trained it using a special method that helps it learn connections between different types of videos and searches. This model is designed to perform well on a wide range of tasks, even ones it hasn't seen before.
Why it matters?
This work is important because it provides a path towards creating video search systems that are much more versatile and useful. By moving beyond narrow benchmarks and focusing on general understanding, we can build systems that can truly help people find the videos they need, regardless of how they describe what they're looking for. The research also highlights that current tests aren't very good at predicting how well a system will perform in the real world, and that finding *partially* relevant videos is a common and important scenario that needs to be addressed.
Abstract
The prevailing video retrieval paradigm is structurally misaligned, as narrow benchmarks incentivize correspondingly limited data and single-task training. Therefore, universal capability is suppressed due to the absence of a diagnostic evaluation that defines and demands multi-dimensional generalization. To break this cycle, we introduce a framework built on the co-design of evaluation, data, and modeling. First, we establish the Universal Video Retrieval Benchmark (UVRB), a suite of 16 datasets designed not only to measure performance but also to diagnose critical capability gaps across tasks and domains. Second, guided by UVRB's diagnostics, we introduce a scalable synthesis workflow that generates 1.55 million high-quality pairs to populate the semantic space required for universality. Finally, we devise the Modality Pyramid, a curriculum that trains our General Video Embedder (GVE) by explicitly leveraging the latent interconnections within our diverse data. Extensive experiments show GVE achieves state-of-the-art zero-shot generalization on UVRB. In particular, our analysis reveals that popular benchmarks are poor predictors of general ability and that partially relevant retrieval is a dominant but overlooked scenario. Overall, our co-designed framework provides a practical path to escape the limited scope and advance toward truly universal video retrieval.