FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

Luca Della Libera, Francesco Paissan, Cem Subakan, Mirco Ravanelli

2025-02-12

FocalCodec: Low-Bitrate Speech Coding via Focal Modulation Networks

Summary

This paper talks about FocalCodec, a new way to compress speech into digital data using less information while still maintaining good quality. It's like creating a super-efficient way to shrink audio files without losing important details of what's being said or how it sounds.

What's the problem?

Current methods for compressing speech either use too much data, lose important information about the meaning or sound of the speech, or use complicated systems that are hard to work with. It's like trying to fit a long conversation into a short text message without losing any of the important parts or tone of voice.

What's the solution?

The researchers created FocalCodec, which uses a clever technique called focal modulation to compress speech into a very small amount of data (between 0.16 and 0.65 kilobits per second). FocalCodec uses a single, simple system to capture both the meaning and sound of speech, making it easier to use in other applications. It works well for different languages and even in noisy environments.

Why it matters?

This matters because it could lead to better ways of storing and transmitting speech in digital form. It could help improve things like voice assistants, language translation apps, or any technology that needs to work with human speech. By using less data while keeping the quality high, it could make these technologies work faster and more efficiently, especially on devices with limited storage or in areas with slow internet connections.

Abstract

Large language models have revolutionized natural language processing through self-supervised pretraining on massive datasets. Inspired by this success, researchers have explored adapting these methods to speech by discretizing continuous audio into tokens using neural audio codecs. However, existing approaches face limitations, including high bitrates, the loss of either semantic or acoustic information, and the reliance on multi-codebook designs when trying to capture both, which increases architectural complexity for downstream tasks. To address these challenges, we introduce FocalCodec, an efficient low-bitrate codec based on focal modulation that utilizes a single binary codebook to compress speech between 0.16 and 0.65 kbps. FocalCodec delivers competitive performance in speech resynthesis and voice conversion at lower bitrates than the current state-of-the-art, while effectively handling multilingual speech and noisy environments. Evaluation on downstream tasks shows that FocalCodec successfully preserves sufficient semantic and acoustic information, while also being well-suited for generative modeling. Demo samples, code and checkpoints are available at https://lucadellalib.github.io/focalcodec-web/.

View Paper