Scaling Transformers for Low-Bitrate High-Quality Speech Coding

Julian D Parker, Anton Smirnov, Jordi Pons, CJ Carr, Zack Zukowski, Zach Evans, Xubo Liu

2024-12-02

Scaling Transformers for Low-Bitrate High-Quality Speech Coding

Summary

This paper discusses a new method for improving speech coding using large transformer models, which helps achieve high-quality speech at very low bit rates.

What's the problem?

Current methods for encoding speech into digital formats often struggle with maintaining high quality while keeping the data size small. This is especially important for applications that require efficient transmission of audio, like streaming or voice communication, where bandwidth may be limited. Traditional models have focused on using fewer parameters but haven't fully explored the potential of larger models to improve quality.

What's the solution?

The authors propose a new approach that scales up transformer architectures, applying a technique called Finite Scalar Quantization (FSQ) to compress speech data effectively. By using a larger model with more parameters and this flexible quantization method, they can achieve high-quality speech coding at extremely low bit rates of just 400 or 700 bits per second. Their models outperform existing methods in both objective tests (measuring performance with specific metrics) and subjective tests (how humans perceive the quality).

Why it matters?

This research is significant because it demonstrates that larger transformer models can be used effectively for speech coding, allowing for better quality audio transmission without requiring more bandwidth. This improvement can benefit various fields such as telecommunications, streaming services, and AI applications that rely on clear audio communication.

Abstract

The tokenization of speech with neural audio codec models is a vital part of modern AI pipelines for the generation or understanding of speech, alone or in a multimodal context. Traditionally such tokenization models have concentrated on low parameter-count architectures using only components with strong inductive biases. In this work we show that by scaling a transformer architecture with large parameter count to this problem, and applying a flexible Finite Scalar Quantization (FSQ) based bottleneck, it is possible to reach state-of-the-art speech quality at extremely low bit-rates of 400 or 700 bits-per-second. The trained models strongly out-perform existing baselines in both objective and subjective tests.

View Paper