JEPA as a Neural Tokenizer: Learning Robust Speech Representations with Density Adaptive Attention
Georgios Ioannides, Christos Constantinou, Aman Chadha, Aaron Elkins, Linsey Pang, Ravid Shwartz-Ziv, Yann LeCun
2025-12-09
Summary
This research presents a new way to create really good, compressed representations of speech, kind of like creating a super-efficient audio file format, but using artificial intelligence.
What's the problem?
Existing methods for compressing speech often either lose important details, require a lot of computing power, or don't capture the underlying structure of how speech works very well. They struggle to find a balance between good quality, small file size, and efficiency for things like language models.
What's the solution?
The researchers built a two-step system. First, they used a technique called JEPA with something called DAAM to learn the important *meaning* of sounds, without trying to perfectly recreate the original audio. This focuses on what the sound *is* rather than the exact waveform. Then, they took those meaningful representations and compressed them even further using a method called FSQ, and reconstructed the audio using a HiFi-GAN decoder. A key part of their system is DAAM, which helps the AI focus on the most important parts of the speech at different moments in time, and understand the overall structure of the speech.
Why it matters?
This work is important because it creates a new way to represent speech that is both highly compressed and retains a lot of information. This is useful for things like voice assistants, speech recognition, and creating more efficient audio files. The resulting compressed speech is also easier for language models to work with, potentially improving their performance.
Abstract
We introduce a two-stage self-supervised framework that combines the Joint-Embedding Predictive Architecture (JEPA) with a Density Adaptive Attention Mechanism (DAAM) for learning robust speech representations. Stage~1 uses JEPA with DAAM to learn semantic audio features via masked prediction in latent space, fully decoupled from waveform reconstruction. Stage~2 leverages these representations for efficient tokenization using Finite Scalar Quantization (FSQ) and a mixed-radix packing scheme, followed by high-fidelity waveform reconstruction with a HiFi-GAN decoder. By integrating Gaussian mixture-based density-adaptive gating into the JEPA encoder, the model performs adaptive temporal feature selection and discovers hierarchical speech structure at a low frame rate of 2.5~Hz. The resulting tokens (47.5 tokens/sec) provide a reversible, highly compressed, and language-model-friendly representation that is competitive with, and often more efficient than, existing neural audio codecs.