Fast Text-to-Audio Generation with Adversarial Post-Training
Zachary Novack, Zach Evans, Zack Zukowski, Josiah Taylor, CJ Carr, Julian Parker, Adnan Al-Sinan, Gian Marco Iodice, Julian McAuley, Taylor Berg-Kirkpatrick, Jordi Pons
2025-05-14
Summary
This paper talks about a new way to make AI models turn written text into audio, like music or sound effects, much faster and with less delay than before.
What's the problem?
The problem is that most current systems that generate audio from text are either too slow or take too long to produce high-quality sounds, which can be frustrating for users who want instant results.
What's the solution?
The researchers used a special training method called Adversarial Relativistic-Contrastive (ARC) post-training to improve how quickly and efficiently these AI models can generate audio. This method helps the models learn to create realistic sounds much faster, reducing the wait time for users.
Why it matters?
This matters because it makes it possible to use AI for real-time applications like instant sound design, interactive games, or live virtual assistants, where speed and quality are both important.
Abstract
Adversarial Relativistic-Contrastive (ARC) post-training optimizes diffusion/flow models for fast text-to-audio generation with minimal latency.