Discrete Audio Tokens: More Than a Survey!

Pooneh Mousavi, Gallil Maimon, Adel Moumen, Darius Petermann, Jiatong Shi, Haibin Wu, Haici Yang, Anastasia Kuznetsova, Artem Ploujnikov, Ricard Marxer, Bhuvana Ramabhadran, Benjamin Elizalde, Loren Lugosch, Jinyu Li, Cem Subakan, Phil Woodland, Minje Kim, Hung-yi Lee, Shinji Watanabe, Yossi Adi, Mirco Ravanelli

2025-06-15

Discrete Audio Tokens: More Than a Survey!

Summary

This paper talks about discrete audio tokens, which are small, simplified digital codes that represent sounds like speech, music, and other audio. These tokens help computers handle audio more efficiently and combine sound processing with modern language models.

What's the problem?

The problem is that audio is usually continuous and complex, making it hard for computers to store, analyze, and generate it quickly and effectively. Existing surveys often focus only on parts of audio or certain tasks and don't give a clear comparison between different methods.

What's the solution?

The paper systematically reviews and compares many different ways to turn audio into discrete tokens across several audio types, like speech and music. It creates a way to classify these methods, tests their quality on different tasks, and studies their strengths and weaknesses to guide future research.

Why it matters?

This matters because having better ways to convert and process audio helps improve many technologies like speech recognition, music generation, and audio-enhanced language models, making sound-based AI systems more powerful and efficient.

Abstract

A systematic review and benchmark of discrete audio tokenizers across speech, music, and general audio domains is presented, covering their taxonomy, evaluation metrics, and limitations.

View Paper