FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs

Md Mubtasim Ahasan, Rafat Hasan Khan, Tasnim Mohiuddin, Aman Chadha, Tariq Iqbal, M Ashraful Amin, Amin Ahsan Ali, Md Mofijul Islam, A K M Mahbubur Rahman

2025-09-16

FuseCodec: Semantic-Contextual Fusion and Supervision for Neural Codecs

Summary

This paper introduces a new method called FuseCodec for converting speech into a series of discrete tokens, similar to how text is broken down into words. It aims to create a better system than existing methods by considering not just the raw sound of speech, but also its meaning and the context in which it's spoken.

What's the problem?

Current speech tokenization methods focus heavily on the basic acoustic features of speech, essentially just capturing the 'sound' of words. They often miss important clues about what the speech *means* or how it relates to the surrounding conversation. While some recent approaches try to incorporate meaning and context, they struggle to effectively combine these different types of information into a unified representation, leading to less accurate and natural-sounding results.

What's the solution?

FuseCodec tackles this problem by blending acoustic, semantic (meaning-based), and contextual (surrounding information) representations of speech. It does this in three main ways: first, it directly integrates semantic and contextual information into the core encoding process. Second, it uses a 'global supervision' technique to ensure the tokens represent the overall meaning and stay consistent over time. Finally, it uses 'temporally aligned supervision' to precisely match contextual and speech tokens, improving accuracy at a detailed level. They also show how this method can be used for creating speech from tokens, a process called text-to-speech.

Why it matters?

This research is important because it significantly improves the accuracy and quality of speech tokenization. Better tokenization leads to better performance in tasks like speech recognition, speech synthesis, and understanding spoken language. FuseCodec sets a new standard for these tasks, outperforming previous methods in several key areas, and opens the door for more advanced and natural-sounding speech technologies.

Abstract

Speech tokenization enables discrete representation and facilitates speech language modeling. However, existing neural codecs capture low-level acoustic features, overlooking the semantic and contextual cues inherent to human speech. While recent efforts introduced semantic representations from self-supervised speech models or incorporated contextual representations from pre-trained language models, challenges remain in aligning and unifying the semantic and contextual representations. We introduce FuseCodec, which unifies acoustic, semantic, and contextual representations through strong cross-modal alignment and globally informed supervision. We propose three complementary techniques: (i) Latent Representation Fusion, integrating semantic and contextual features directly into the encoder latent space for robust and unified representation learning; (ii) Global Semantic-Contextual Supervision, supervising discrete tokens with globally pooled and broadcasted representations to enhance temporal consistency and cross-modal alignment; and (iii) Temporally Aligned Contextual Supervision, strengthening alignment by dynamically matching contextual and speech tokens within a local window for fine-grained token-level supervision. We further introduce FuseCodec-TTS, demonstrating our methodology's applicability to zero-shot speech synthesis. Empirically, FuseCodec achieves state-of-the-art performance in LibriSpeech, surpassing EnCodec, SpeechTokenizer, and DAC in transcription accuracy, perceptual quality, intelligibility, and speaker similarity. Results highlight the effectiveness of contextually and semantically guided tokenization for speech tokenization and downstream tasks. Code and pretrained models are available at https://github.com/mubtasimahasan/FuseCodec.

View Paper