WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Shengpeng Ji, Ziyue Jiang, Xize Cheng, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Ruiqi Li, Ziang Zhang, Xiaoda Yang, Rongjie Huang, Yidi Jiang, Qian Chen, Siqi Zheng, Wen Wang, Zhou Zhao

2024-08-30

WavTokenizer: an Efficient Acoustic Discrete Codec Tokenizer for Audio Language Modeling

Summary

This paper introduces WavTokenizer, a new tool that efficiently compresses audio data into smaller, manageable pieces while maintaining high sound quality for language modeling.

What's the problem?

Audio files can be very large and difficult to process, especially when using traditional models that don’t effectively reduce their size without losing quality. This makes it hard to use audio in applications like speech recognition or music analysis.

What's the solution?

WavTokenizer solves this issue by compressing one second of audio (at a 24kHz sampling rate) into only 40 or 75 tokens, which are much smaller than the original data. It achieves this through advanced techniques like broader quantization spaces and improved attention networks, ensuring that the audio quality remains high even with fewer tokens. Extensive tests showed that WavTokenizer outperforms previous models in both sound quality and efficiency.

Why it matters?

This advancement is important because it allows for faster and more efficient processing of audio data in various applications, such as virtual assistants, music production, and other AI technologies. By reducing the amount of data needed while keeping the quality intact, WavTokenizer can help improve user experiences and broaden the use of audio in technology.

Abstract

Language models have been effectively applied to modeling natural signals, such as images, video, speech, and audio. A crucial component of these models is the codec tokenizer, which compresses high-dimensional natural signals into lower-dimensional discrete tokens. In this paper, we introduce WavTokenizer, which offers several advantages over previous SOTA acoustic codec models in the audio domain: 1)extreme compression. By compressing the layers of quantizers and the temporal dimension of the discrete codec, one-second audio of 24kHz sampling rate requires only a single quantizer with 40 or 75 tokens. 2)improved subjective quality. Despite the reduced number of tokens, WavTokenizer achieves state-of-the-art reconstruction quality with outstanding UTMOS scores and inherently contains richer semantic information. Specifically, we achieve these results by designing a broader VQ space, extended contextual windows, and improved attention networks, as well as introducing a powerful multi-scale discriminator and an inverse Fourier transform structure. We conducted extensive reconstruction experiments in the domains of speech, audio, and music. WavTokenizer exhibited strong performance across various objective and subjective metrics compared to state-of-the-art models. We also tested semantic information, VQ utilization, and adaptability to generative models. Comprehensive ablation studies confirm the necessity of each module in WavTokenizer. The related code, demos, and pre-trained models are available at https://github.com/jishengpeng/WavTokenizer.

View Paper