Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

Marco De Nadai, Andreas Damianou, Mounia Lalmas

2025-08-20

Describe What You See with Multimodal Large Language Models to Enhance Video Recommendations

Summary

This paper introduces a new way to recommend videos by using smart language models to understand the deeper meaning of video content, making recommendations more personalized and accurate.

What's the problem?

Current video recommendation systems often miss the subtle meanings or context within videos, focusing only on basic descriptions or technical details like what's visually present or the sounds. This means they can't grasp things like humor, intent, or cultural references, which are important for viewers to connect with content. For example, they might not distinguish between a singer on a rooftop and a parody filmed in a famous tourist spot, missing key information for good recommendations.

What's the solution?

The researchers developed a framework that uses a powerful language model, which can understand and generate text, to summarize each video clip into a rich, natural language description. This description captures higher-level ideas, like 'a superhero parody with slapstick fights and orchestral stabs,' without needing to retrain the model or change the recommendation system itself. These detailed descriptions are then fed into existing recommendation methods to improve their performance.

Why it matters?

This approach is important because it makes video recommendations much smarter by understanding the actual content and its potential appeal to users, not just its surface-level features. By using these language models, recommendation systems can better understand user intent and preferences, leading to a more engaging and satisfying experience for viewers, especially on platforms with a vast amount of short-form videos.

Abstract

Existing video recommender systems rely primarily on user-defined metadata or on low-level visual and acoustic signals extracted by specialised encoders. These low-level features describe what appears on the screen but miss deeper semantics such as intent, humour, and world knowledge that make clips resonate with viewers. For example, is a 30-second clip simply a singer on a rooftop, or an ironic parody filmed amid the fairy chimneys of Cappadocia, Turkey? Such distinctions are critical to personalised recommendations yet remain invisible to traditional encoding pipelines. In this paper, we introduce a simple, recommendation system-agnostic zero-finetuning framework that injects high-level semantics into the recommendation pipeline by prompting an off-the-shelf Multimodal Large Language Model (MLLM) to summarise each clip into a rich natural-language description (e.g. "a superhero parody with slapstick fights and orchestral stabs"), bridging the gap between raw content and user intent. We use MLLM output with a state-of-the-art text encoder and feed it into standard collaborative, content-based, and generative recommenders. On the MicroLens-100K dataset, which emulates user interactions with TikTok-style videos, our framework consistently surpasses conventional video, audio, and metadata features in five representative models. Our findings highlight the promise of leveraging MLLMs as on-the-fly knowledge extractors to build more intent-aware video recommenders.

View Paper