Paper2Video: Automatic Video Generation from Scientific Papers
Zeyu Zhu, Kevin Qinghong Lin, Mike Zheng Shou
2025-10-07

Summary
This paper introduces PaperTalker, a new dataset and AI system designed to automatically create presentation videos from research papers, aiming to make sharing research findings much easier and faster.
What's the problem?
Creating good presentation videos explaining research is currently a lot of work, taking hours to design slides, record audio, and edit everything together, even for short videos. Existing video generation tools aren't well-suited for this task because research presentations have unique needs – they rely on complex information from papers, including text, charts, and tables, and require everything to be synchronized like slides with spoken explanations.
What's the solution?
The researchers created a dataset called PaperTalker containing over 100 research papers along with the presentation videos made by the original authors. They also developed a new AI system, also called PaperTalker, that uses multiple AI 'agents' working together to automatically generate these videos. This system handles everything from creating the slides and arranging them nicely, to adding subtitles, generating speech, and even creating a realistic talking head to present the information, all while speeding up the process by working on each slide independently.
Why it matters?
This work is important because it takes a big step towards automating the creation of academic presentation videos. This could significantly reduce the time and effort researchers spend on video production, allowing them to focus more on the research itself and making their work more accessible to a wider audience. The new dataset and evaluation methods also provide a standard way to measure the quality of automatically generated presentation videos.
Abstract
Academic presentation videos have become an essential medium for research communication, yet producing them remains highly labor-intensive, often requiring hours of slide design, recording, and editing for a short 2 to 10 minutes video. Unlike natural video, presentation video generation involves distinctive challenges: inputs from research papers, dense multi-modal information (text, figures, tables), and the need to coordinate multiple aligned channels such as slides, subtitles, speech, and human talker. To address these challenges, we introduce PaperTalker, the first benchmark of 101 research papers paired with author-created presentation videos, slides, and speaker metadata. We further design four tailored evaluation metrics--Meta Similarity, PresentArena, PresentQuiz, and IP Memory--to measure how videos convey the paper's information to the audience. Building on this foundation, we propose PaperTalker, the first multi-agent framework for academic presentation video generation. It integrates slide generation with effective layout refinement by a novel effective tree search visual choice, cursor grounding, subtitling, speech synthesis, and talking-head rendering, while parallelizing slide-wise generation for efficiency. Experiments on Paper2Video demonstrate that the presentation videos produced by our approach are more faithful and informative than existing baselines, establishing a practical step toward automated and ready-to-use academic video generation. Our dataset, agent, and code are available at https://github.com/showlab/Paper2Video.