< Explain other AI papers

Bidirectional Likelihood Estimation with Multi-Modal Large Language Models for Text-Video Retrieval

Dohwan Ko, Ji Soo Lee, Minhyuk Choi, Zihang Meng, Hyunwoo J. Kim

2025-08-06

Bidirectional Likelihood Estimation with Multi-Modal Large Language
  Models for Text-Video Retrieval

Summary

This paper talks about a new way to improve text-video retrieval by using a method called bidirectional likelihood estimation with multi-modal large language models, which looks at both how well a video matches a text query and how well the text matches the video.

What's the problem?

The problem is that current retrieval methods often favor popular or common candidates instead of the ones most relevant to the query, causing bias and worse search results.

What's the solution?

The paper introduces a framework that trains the model to generate text from videos and video features from text, combining these two efforts to better match queries with candidates. It also uses a technique called Candidate Prior Normalization to adjust scores and reduce bias without extra training.

Why it matters?

This matters because it makes searching large video and text databases more accurate by focusing on the true relevance between the two, which helps users find more useful and correct results quickly.

Abstract

A novel retrieval framework using bidirectional likelihood estimation with multi-modal large language models and candidate prior normalization improves text-video retrieval by reducing candidate prior bias and enhancing query-candidate relevance.