Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization

Sang-Hoon Lee, Ha-Yeong Choi, Seong-Whan Lee

2024-08-16

Accelerating High-Fidelity Waveform Generation via Adversarial Flow Matching Optimization

Summary

This paper introduces PeriodWave-Turbo, a new model for generating high-quality sound waves quickly and efficiently using advanced techniques.

What's the problem?

Generating high-fidelity sound waves can be slow and complicated. Current models, like conditional flow matching (CFM), produce good results but often require many steps to create sound, which makes them less efficient than other methods like GANs (Generative Adversarial Networks). Additionally, these models sometimes struggle to capture high-frequency details in the sound, leading to lower quality output.

What's the solution?

To solve these issues, the authors enhanced existing CFM models by introducing a fixed-step generator modification that speeds up the process. They used techniques like reconstruction losses and adversarial feedback to improve the quality of the generated sound waves. With these changes, PeriodWave-Turbo requires only a few steps to generate high-quality sound while maintaining excellent performance in various tests.

Why it matters?

This research is significant because it makes it easier and faster to create high-quality audio for applications like music production, speech synthesis, and other audio technologies. By improving the efficiency of waveform generation, it can lead to better tools for creators and developers in the audio industry.

Abstract

This paper introduces PeriodWave-Turbo, a high-fidelity and high-efficient waveform generation model via adversarial flow matching optimization. Recently, conditional flow matching (CFM) generative models have been successfully adopted for waveform generation tasks, leveraging a single vector field estimation objective for training. Although these models can generate high-fidelity waveform signals, they require significantly more ODE steps compared to GAN-based models, which only need a single generation step. Additionally, the generated samples often lack high-frequency information due to noisy vector field estimation, which fails to ensure high-frequency reproduction. To address this limitation, we enhance pre-trained CFM-based generative models by incorporating a fixed-step generator modification. We utilized reconstruction losses and adversarial feedback to accelerate high-fidelity waveform generation. Through adversarial flow matching optimization, it only requires 1,000 steps of fine-tuning to achieve state-of-the-art performance across various objective metrics. Moreover, we significantly reduce inference speed from 16 steps to 2 or 4 steps. Additionally, by scaling up the backbone of PeriodWave from 29M to 70M parameters for improved generalization, PeriodWave-Turbo achieves unprecedented performance, with a perceptual evaluation of speech quality (PESQ) score of 4.454 on the LibriTTS dataset. Audio samples, source code and checkpoints will be available at https://github.com/sh-lee-prml/PeriodWave.

View Paper