MOSS-Audio-Tokenizer: Scaling Audio Tokenizers for Future Audio Foundation Models
Yitian Gong, Kuangwei Chen, Zhaoye Fei, Xiaogui Yang, Ke Chen, Yang Wang, Kexin Huang, Mingshu Chen, Ruixiao Li, Qingyuan Cheng, Shimin Li, Xipeng Qiu
2026-02-13
Summary
This paper introduces a new way to convert audio into a format that large language models can understand and work with, like text. It focuses on creating a system that learns this conversion from scratch, without relying on pre-existing audio understanding tools.
What's the problem?
Current methods for turning audio into a usable format for AI often use pre-built components or complex designs that limit how well the audio can be reconstructed and how easily the system can be improved with more data. These existing systems have built-in assumptions about audio that might not always be correct, hindering their ability to handle diverse sounds and scale up effectively.
What's the solution?
The researchers developed a system called CAT (Causal Audio Tokenizer with Transformer), which is built entirely from Transformer blocks – the same technology powering many modern language models. This system learns to encode, compress, and decode audio all at once, from the ground up. They then scaled this up into a massive audio tokenizer called MOSS-Audio-Tokenizer, training it on a huge amount of audio data. This end-to-end approach allows for better audio quality and easier scaling.
Why it matters?
This work is important because it provides a simpler and more effective way to integrate audio directly into large language models. The new tokenizer outperforms existing methods in tasks like speech recognition, text-to-speech, and general audio compression, and it opens the door to building more powerful AI systems that can natively understand and generate audio.
Abstract
Discrete audio tokenizers are fundamental to empowering large language models with native audio processing and generation capabilities. Despite recent progress, existing approaches often rely on pretrained encoders, semantic distillation, or heterogeneous CNN-based architectures. These designs introduce fixed inductive biases that limit reconstruction fidelity and hinder effective scaling. In this paper, we argue that discrete audio tokenization should be learned fully end-to-end using a homogeneous and scalable architecture. To this end, we first propose CAT (Causal Audio Tokenizer with Transformer), a purely Transformer-based architecture that jointly optimizes the encoder, quantizer, and decoder from scratch for high-fidelity reconstruction. Building on the CAT architecture, we develop MOSS-Audio-Tokenizer, a large-scale audio tokenizer featuring 1.6 billion parameters, pre-trained on 3 million hours of diverse, general audio data. We show that this simple, fully end-to-end approach built from homogeneous, causal Transformer blocks scales gracefully and supports high-fidelity reconstruction across diverse audio domains. Across speech, sound, and music, MOSS-Audio-Tokenizer consistently outperforms prior codecs over a wide range of bitrates, while exhibiting predictable improvements with increased scale. Notably, leveraging the discrete tokens from our model, we develop the first purely autoregressive TTS model that surpasses prior non-autoregressive and cascaded systems. Furthermore, MOSS-Audio-Tokenizer enables competitive ASR performance without auxiliary encoders. Our findings position the CAT architecture as a unified, scalable interface for the next generation of native audio foundation models.