ISDrama: Immersive Spatial Drama Generation through Multimodal Prompting
Yu Zhang, Wenxiang Guo, Changhao Pan, Zhiyuan Zhu, Tao Jin, Zhou Zhao
2025-04-30
Summary
This paper talks about ISDrama, a new AI system that can create realistic and dramatic audio performances, making it sound like characters are actually moving and speaking in a 3D space.
What's the problem?
Most computer-generated voices sound flat and don't capture the excitement or emotion of real-life performances, and they also can't make it feel like voices are coming from different places around you.
What's the solution?
The researchers built ISDrama using a combination of different types of input, like text and sound, and special AI techniques to help the system learn how to make voices sound more emotional and three-dimensional, as if they're happening all around the listener.
Why it matters?
This matters because it could make things like virtual reality, video games, and audiobooks much more immersive and entertaining by making the audio feel more lifelike and dramatic.
Abstract
A multimodal spatial drama generation model, ISDrama, leverages contrastive learning and a flow-based transformer to produce high-quality binaural speech with dramatic prosody from multimodal inputs.