Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis

Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi DAI, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, Wei Xue

2025-02-07

Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based
Speech Synthesis

Summary

This paper talks about Llasa, a new text-to-speech (TTS) system that uses advanced AI techniques to generate natural and expressive speech from text. It simplifies the process by using a single Transformer model, making it more efficient and scalable.

What's the problem?

Most current TTS systems are complicated, requiring multiple models to handle different parts of the speech generation process. This makes it hard to decide which parts to improve and increases the cost and complexity of scaling these systems. Additionally, existing systems often struggle to produce speech that sounds truly natural and emotionally expressive.

What's the solution?

The researchers developed Llasa, which uses a single-layer vector quantizer (VQ) codec and a Transformer-based architecture inspired by large language models like Llama. Llasa improves both training and inference processes by scaling computational power effectively. It generates high-quality speech with better naturalness, emotional expressiveness, and voice consistency. The system also includes features like speech verifiers to refine its output and can adapt to different voices without extra training.

Why it matters?

This research is important because it simplifies the creation of realistic and expressive speech, making TTS technology more accessible and efficient. Llasa's ability to scale easily and produce high-quality results could lead to better virtual assistants, audiobooks, accessibility tools, and other applications where natural-sounding speech is essential.

Abstract

Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework Llasa for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as Llama. Our experiments reveal that scaling train-time compute for Llasa consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Furthermore, from the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy. In addition, we released the checkpoint and training code for our TTS model (1B, 3B, 8B) and codec model publicly available.

View Paper