Llasa: Scaling Train-Time and Inference-Time Compute for

Llama-based Speech Synthesis

Abstract.Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First, we explore the scaling of train-time and inference-time compute for speech synthesis. Second, we propose a simple framework LLaSA for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as LLaMA. Our experiments reveal that scaling train-time compute for LLaSA consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns. Furthermore, from the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy. In addition, we released the checkpoint and training code for our TTS model (1B, 3B, 8B) and codec model publicly available.

Contents

Comparision Inference-Time scaling results using a different evaluation metric

The left figure uses different speaker embedding model speechbrain/spkrec-ecapa-voxceleb as a reference evalution metric for speaker similarity. The right figure is the original fig.2.
Image 1
Image 2

Comparision Results on Ravdess Benchmark

Ravdess has only two texts: "Dogs are sitting by the door." for prompt text, and "Kids are talking by the door." for synthesis text. The following results for NaturalSpeech 3, NaturalSpeech 2, Voicebox (R), VALL-E (R), Mega-TTS 2, StyleTTS 2, and HierSpeech++ are taken from the official NaturalSpeech 3 demo page. (R) indicates that these are reproduced by NaturalSpeech 3.
Prompt Emotion Prompt Ground Truth Llasa-1b-250k Llasa-3b-250k Llasa-8b-250k FireRedTTS F5-TTS MaskGCT E2-TTS CosyVoice2 CosyVoice NaturalSpeech 3 NaturalSpeech 2 Voicebox (R) VALL-E (R) Mega-TTS 2 StyleTTS 2 HierSpeech++
neutral
happy
calm
sad
angry
fearful
disgust
surprised

Scaling Train-Time Compute

We randomly selected two samples from the English test set. All synthesized audio was generated solely from the input text (without any speech prompts), and each model was sampled three times at random to specifically evaluate its text comprehension ability. The table below presents the results across models of various sizes and training data amounts.

Sample Llasa-1b-80k Llasa-1b-160k Llasa-1b-250k Llasa-3b-250k Llasa-8b-250k
"Uh, are you sure about this?" Tim asked nervously, looking at the steep slope before them. "Whoa, it’s higher than I thought," he continued, his voice filled with trepidation. "Aha, but look at the view," Emily responded with excitement, "it’s worth the climb!"
Random Sample 1
Random Sample 2
Random Sample 3
Her hands shaking with excitement, Alice Monroe stuttered, "oh..I-I can’t believe it! Is this really my acceptance letter to Harvard?" Marco cannot believe it either: "God damn it! How did you pull this off?"
Random Sample 1
Random Sample 2
Random Sample 3

Two samples selected from the Chinese test set randomly.

Sample Llasa-1b-80k Llasa-1b-160k Llasa-1b-250k Llasa-3b-250k Llasa-8b-250k
帘外雨潺潺,春意阑珊。罗衾不耐五更寒。梦里不知身是客,一晌贪欢。独自莫凭栏,无限江山。别时容易见时难。流水落花春去也,天上人间。
Random Sample 1
Random Sample 2
Random Sample 3
人要是行,干一行行一行,一行行行行行,行行行干哪行都行,要是不行,干一行不行一行,一行不行行行不行,行行不行,干哪行都不行。
Random Sample 1
Random Sample 2
Random Sample 3

Scaling Inference-Time Compute

Using the Llasa-1b-250k model, we compared the results of direct inference and inference-time scaling. The two examples shown below were randomly selected from the seed-tts-eval test-hard.

Target Text Prompt Directly Inference Scaling Inference-Time Compute
喇嘛与哑巴 打南边来了个哑巴,腰里别了个喇叭; 打北边来了个喇嘛,手里提了个獭犸. 提着獭犸的喇嘛要拿獭犸换别着喇叭的哑巴的喇叭; 别着喇叭的哑巴不愿拿喇叭换提着獭犸的喇嘛的獭犸. 不知是别着喇叭的哑巴打了提着獭犸的喇嘛一喇叭; 还是提着獭犸的喇嘛打了别着喇叭的哑巴一獭犸. 喇嘛回家炖獭犸; 哑巴嘀嘀哒哒吹喇叭
高高山上一座庙,住了八个出家人,八个道人都有名:大弟子,叫凳大,二弟子,叫大凳,三弟子,叫猴三,四弟子,叫三猴,五弟子,叫瓶茶,六弟子,叫茶瓶,七弟子,叫冰别边,八弟子,叫边别冰。凳大会打鼓,大凳会撞钟,猴三会烧火,三猴会点灯;瓶茶会吹管,茶瓶会吹笙;冰别边会煮饭,边别冰会念经。大凳要打凳大鼓,凳大要撞大凳钟;三猴要烧猴三火,猴三要点三猴灯;茶瓶要吹瓶茶管,瓶茶要吹茶瓶笙;边别冰要煮冰别边的饭,冰别边要念边别冰的经。大凳打不好凳大的鼓,凳大撞不好大凳的钟;三猴烧不好猴三的火,猴三点不好三猴的灯;茶瓶吹不好瓶茶的管,瓶茶吹不好茶瓶的笙;边别冰煮不好冰别边的饭,冰别边念不好边别冰的经。凳大还打凳大鼓,大凳还撞大凳钟;猴三还烧猴三火,三猴还点三猴灯;瓶茶还吹瓶茶管,茶瓶还吹茶瓶笙;冰别边还煮冰别边的饭,边别冰还念边别冰的经。各人还干各一行,白白争个脸红脖子青。
We use Llasa-1b-250k for continuation experiment on the LibriSpeech test-clean dataset. The generated audio for each sample starts with the first 3 seconds of the ground truth audio, followed by the model's generated continuation.
Ground Truth Directly Inference Scaling Inference-Time Compute

Codec Reconstruction Samples

Sample GT Xcodec2 StableCodec WavTokenizer_40tps WavTokenizer_75tps Xcodec_nq1 Xcodec_nq2 BigCodec DAC_16k_nq1 DAC_16k_nq2 DAC_16k_nq12 Encodec_nq2 Encodec_nq8 Mimi_nq4 Mimi_nq6 Mimi_nq8 SemanticCodec SpeechTokenizer_nq1 SpeechTokenizer_nq2
Sample 1
Sample 2
Sample 3
Sample 4