Efficient Speech Language Modeling via Energy Distance in Continuous Latent Space

Zhengrui Ma, Yang Feng, Chenze Shao, Fandong Meng, Jie Zhou, Min Zhang

2025-05-20

Efficient Speech Language Modeling via Energy Distance in Continuous
Latent Space

Summary

This paper talks about SLED, a new method that helps computers understand and generate speech more efficiently by turning sound waves into a special kind of code and using a smart way to measure how well the model is doing.

What's the problem?

The problem is that making computers talk or understand speech usually takes a lot of computer power and can be slow or not very accurate, especially when dealing with real, continuous sounds instead of simple text.

What's the solution?

To solve this, the researchers created a system that first changes speech into a continuous code, called a latent representation, and then trains the model to predict and generate speech using a measurement called energy distance, which helps it learn faster and more accurately.

Why it matters?

This matters because it means speech-based AI, like voice assistants or translation apps, can work better and faster, making them more helpful and accessible for everyone.

Abstract

SLED encodes speech waveforms into continuous latent representations and uses an energy distance objective to model them autoregressively for efficient and accurate speech synthesis.

View Paper