AudioStory: Generating Long-Form Narrative Audio with Large Language Models
Yuxin Guo, Teng Wang, Yuying Ge, Shijie Ma, Yixiao Ge, Wei Zou, Ying Shan
2025-08-28
Summary
This paper introduces AudioStory, a new system for creating long, coherent audio narratives from text descriptions, like turning a story into a soundscape.
What's the problem?
Current text-to-audio technology is really good at making short sounds, but it struggles when you want it to create a longer piece of audio, like a story with different scenes and consistent emotions, because it's hard to maintain a sense of flow and make everything sound connected over time.
What's the solution?
The researchers built AudioStory, which combines the power of large language models – the kind that can understand and generate human-like text – with text-to-audio systems. The language model breaks down the story into smaller, manageable parts, figuring out what sound should happen when and how to smoothly transition between scenes. It uses a clever 'bridging' technique to make sure each sound fits the current part of the story and also connects to what came before. Importantly, the whole system is trained together, instead of building each piece separately, which helps everything work better as a whole.
Why it matters?
This work is important because it moves us closer to being able to automatically generate high-quality, long-form audio content, like audiobooks, podcasts, or sound effects for games and movies, directly from text. They also created a new dataset to help other researchers improve these kinds of systems, and showed their system performs better than existing methods in both understanding instructions and creating realistic audio.
Abstract
Recent advances in text-to-audio (TTA) generation excel at synthesizing short audio clips but struggle with long-form narrative audio, which requires temporal coherence and compositional reasoning. To address this gap, we propose AudioStory, a unified framework that integrates large language models (LLMs) with TTA systems to generate structured, long-form audio narratives. AudioStory possesses strong instruction-following reasoning generation capabilities. It employs LLMs to decompose complex narrative queries into temporally ordered sub-tasks with contextual cues, enabling coherent scene transitions and emotional tone consistency. AudioStory has two appealing features: (1) Decoupled bridging mechanism: AudioStory disentangles LLM-diffuser collaboration into two specialized components, i.e., a bridging query for intra-event semantic alignment and a residual query for cross-event coherence preservation. (2) End-to-end training: By unifying instruction comprehension and audio generation within a single end-to-end framework, AudioStory eliminates the need for modular training pipelines while enhancing synergy between components. Furthermore, we establish a benchmark AudioStory-10K, encompassing diverse domains such as animated soundscapes and natural sound narratives. Extensive experiments show the superiority of AudioStory on both single-audio generation and narrative audio generation, surpassing prior TTA baselines in both instruction-following ability and audio fidelity. Our code is available at https://github.com/TencentARC/AudioStory