PodAgent: A Comprehensive Framework for Podcast Generation
Yujia Xiao, Lei He, Haohan Guo, Fenglong Xie, Tan Lee
2025-03-04
Summary
This paper talks about PodAgent, a new AI system that can automatically create podcast-like audio programs. It uses multiple AI 'agents' working together to generate content, match voices to roles, and produce expressive speech.
What's the problem?
Current AI methods for creating audio programs struggle to generate high-quality, podcast-like content. They have trouble producing in-depth discussions and matching appropriate voices to the content. There's also no standard way to evaluate how good these AI-generated podcasts are.
What's the solution?
The researchers created PodAgent, which uses three main techniques to solve these problems. First, it has a system of AI 'agents' (Host, Guest, and Writer) that work together to create informative discussions. Second, it builds a pool of voices to match with different roles in the podcast. Third, it uses advanced AI language models to make the synthetic speech sound more natural and expressive. They also developed new ways to evaluate how well the system works.
Why it matters?
This matters because it could change how podcasts are made in the future. PodAgent could make it easier and faster to create high-quality audio content on any topic. This could lead to more diverse and accessible podcasts, potentially revolutionizing the audio content industry. However, it also raises questions about the role of human creators in podcasting and the authenticity of AI-generated content.
Abstract
Existing Existing automatic audio generation methods struggle to generate podcast-like audio programs effectively. The key challenges lie in in-depth content generation, appropriate and expressive voice production. This paper proposed PodAgent, a comprehensive framework for creating audio programs. PodAgent 1) generates informative topic-discussion content by designing a Host-Guest-Writer multi-agent collaboration system, 2) builds a voice pool for suitable voice-role matching and 3) utilizes LLM-enhanced speech synthesis method to generate expressive conversational speech. Given the absence of standardized evaluation criteria for podcast-like audio generation, we developed comprehensive assessment guidelines to effectively evaluate the model's performance. Experimental results demonstrate PodAgent's effectiveness, significantly surpassing direct GPT-4 generation in topic-discussion dialogue content, achieving an 87.4% voice-matching accuracy, and producing more expressive speech through LLM-guided synthesis. Demo page: https://podcast-agent.github.io/demo/. Source code: https://github.com/yujxx/PodAgent.