PresentAgent: Multimodal Agent for Presentation Video Generation

Jingwei Shi, Zeyu Zhang, Biao Wu, Yanjie Liang, Meng Fang, Ling Chen, Yang Zhao

2025-07-08

PresentAgent: Multimodal Agent for Presentation Video Generation

Summary

This paper talks about PresentAgent, an AI system that can create high-quality presentation videos with narration from long documents. It uses different modules to handle tasks like understanding the document, generating speech, and creating video content.

What's the problem?

The problem is that making presentation videos, especially with narration, from long documents usually takes a lot of time and effort by humans, and existing AI tools don’t fully automate the process or produce high-quality results.

What's the solution?

The researchers developed PresentAgent with a modular pipeline that breaks down the process into smaller steps, including content understanding, script writing, narration, and video generation. They also built PresentEval, a new way to evaluate how well these presentation videos perform by using AI models that understand both vision and language.

Why it matters?

This matters because PresentAgent can help people quickly create professional and engaging presentations from documents, saving time and making knowledge sharing easier and more accessible.

Abstract

PresentAgent generates high-quality narrated presentation videos from long-form documents using a modular pipeline and is evaluated using PresentEval, a Vision-Language Model-based framework.

View Paper