HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution

Shengkui Zhao, Kun Zhou, Zexu Pan, Yukun Ma, Chong Zhang, Bin Ma

2025-01-20

HiFi-SR: A Unified Generative Transformer-Convolutional Adversarial Network for High-Fidelity Speech Super-Resolution

Summary

This paper talks about HiFi-SR, a new AI system that can take low-quality audio recordings and make them sound much clearer and more detailed. It's like having a super-smart audio enhancer that can make any voice recording sound like it was recorded with a high-end microphone.

What's the problem?

Current methods for improving audio quality often use separate AI models for different parts of the process, which can lead to inconsistencies and poor results, especially when dealing with types of audio the AI hasn't been trained on. It's like trying to restore an old painting by having different artists work on different parts without coordinating – the final result might not look right.

What's the solution?

The researchers created HiFi-SR, which combines two types of AI networks (transformer and convolutional) into one unified system. This system works together seamlessly to improve audio quality. It uses a clever technique called adversarial training, where one part of the AI tries to create better audio while another part tries to spot fake-sounding audio, making the whole system better over time. HiFi-SR can take any audio recording with a frequency between 4 kHz and 32 kHz and upgrade it to studio-quality 48 kHz audio.

Why it matters?

This matters because it could dramatically improve the quality of all kinds of voice recordings. Imagine being able to make old, scratchy recordings of historical speeches sound crystal clear, or upgrading low-quality phone calls to high-definition audio. It could be used in many fields, from restoring archived audio to improving voice assistants and making video calls sound better. Plus, it works well even on types of audio it wasn't specifically trained on, which means it could be used in many different situations without needing to be retrained each time.

Abstract

The application of generative adversarial networks (GANs) has recently advanced speech super-resolution (SR) based on intermediate representations like mel-spectrograms. However, existing SR methods that typically rely on independently trained and concatenated networks may lead to inconsistent representations and poor speech quality, especially in out-of-domain scenarios. In this work, we propose HiFi-SR, a unified network that leverages end-to-end adversarial training to achieve high-fidelity speech super-resolution. Our model features a unified transformer-convolutional generator designed to seamlessly handle both the prediction of latent representations and their conversion into time-domain waveforms. The transformer network serves as a powerful encoder, converting low-resolution mel-spectrograms into latent space representations, while the convolutional network upscales these representations into high-resolution waveforms. To enhance high-frequency fidelity, we incorporate a multi-band, multi-scale time-frequency discriminator, along with a multi-scale mel-reconstruction loss in the adversarial training process. HiFi-SR is versatile, capable of upscaling any input speech signal between 4 kHz and 32 kHz to a 48 kHz sampling rate. Experimental results demonstrate that HiFi-SR significantly outperforms existing speech SR methods across both objective metrics and ABX preference tests, for both in-domain and out-of-domain scenarios (https://github.com/modelscope/ClearerVoice-Studio).

View Paper