Summarization of Multimodal Presentations with Vision-Language Models: Study of the Effect of Modalities and Structure

Théo Gigant, Camille Guinaudeau, Frédéric Dufaux

2025-04-16

Summarization of Multimodal Presentations with Vision-Language Models:
Study of the Effect of Modalities and Structure

Summary

This paper talks about how AI models that understand both images and text can be used to automatically summarize presentations that include slides and spoken words.

What's the problem?

The problem is that presentations often have a mix of visuals, like slides, and spoken explanations, making it hard for regular AI systems to create good summaries because they usually only handle text or images separately. This means important information can get lost or misunderstood when trying to summarize the whole presentation.

What's the solution?

The researchers studied different ways to combine slides and transcripts for AI models called vision-language models. They found that when the AI is given a structured format that clearly shows how the slides and spoken words are connected and organized, it does a much better job at summarizing the main points of the presentation.

Why it matters?

This matters because it can help students, professionals, and anyone who needs to quickly understand long or complicated presentations. Better summaries save time and make it easier to learn or review important information from multimodal content.

Abstract

Analyses of VLMs for automatic summarization of multimodal presentations demonstrate that structured representations of interleaved slides and transcripts yield the best performance.

View Paper