A novel retrieval framework using bidirectional likelihood estimation with multi-modal large language models and candidate prior normalization improves text-video retrieval by reducing candidate prior bias and enhancing query-candidate relevance.

This paper talks about a new way to improve text-video retrieval by using a method called bidirectional likelihood estimation with multi-modal large language models, which looks at both how well a video matches a text query and how well the text matches the video.

Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval

Summary

What's the problem?

What's the solution?

Why it matters?

Abstract