Llasa: Scaling Train-Time and Inference-Time Compute for
Llama-based Speech Synthesis
Abstract.Recent advances in text-based large language models (LLMs), particularly in the GPT series and the o1 model, have demonstrated the effectiveness of scaling both training-time and inference-time compute. However, current state-of-the-art TTS systems leveraging LLMs are often multi-stage, requiring separate models (e.g., diffusion models after LLM), complicating the decision of whether to scale a particular model during training or testing. This work makes the following contributions: First,
we explore the scaling of train-time and inference-time compute for speech synthesis.
Second, we propose a simple framework LLaSA for speech synthesis that employs a single-layer vector quantizer (VQ) codec and a single Transformer architecture to fully align with standard LLMs such as LLaMA.
Our experiments reveal that scaling train-time compute for LLaSA consistently improves the naturalness of synthesized speech and enables the generation of more complex and accurate prosody patterns.
Furthermore, from the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that scaling inference-time compute shifts the sampling modes toward the preferences of specific verifiers, thereby improving emotional expressiveness, timbre consistency, and content accuracy.
In addition, we released the checkpoint and training code for our TTS model (1B, 3B, 8B) and codec model publicly available.
Comparision Inference-Time scaling results using a different evaluation metric
The left figure uses different speaker embedding model speechbrain/spkrec-ecapa-voxceleb as a reference evalution metric for speaker similarity. The right figure is the original fig.2.
Comparision Results on Ravdess Benchmark
Ravdess has only two texts: "Dogs are sitting by the door." for prompt text, and "Kids are talking by the door." for synthesis text. The following results for NaturalSpeech 3, NaturalSpeech 2, Voicebox (R), VALL-E (R), Mega-TTS 2, StyleTTS 2, and HierSpeech++ are taken from the official NaturalSpeech 3 demo page. (R) indicates that these are reproduced by NaturalSpeech 3.
Prompt Emotion
Prompt
Ground Truth
Llasa-1b-250k
Llasa-3b-250k
Llasa-8b-250k
FireRedTTS
F5-TTS
MaskGCT
E2-TTS
CosyVoice2
CosyVoice
NaturalSpeech 3
NaturalSpeech 2
Voicebox (R)
VALL-E (R)
Mega-TTS 2
StyleTTS 2
HierSpeech++
neutral
happy
calm
sad
angry
fearful
disgust
surprised
Scaling Train-Time Compute
We randomly selected two samples from the English test set. All synthesized audio was generated solely from the input text (without any speech prompts), and each model was sampled three times at random to specifically evaluate its text comprehension ability. The table below presents the results across models of various sizes and training data amounts.
Sample
Llasa-1b-80k
Llasa-1b-160k
Llasa-1b-250k
Llasa-3b-250k
Llasa-8b-250k
"Uh, are you sure about this?" Tim asked nervously, looking at the steep slope before them. "Whoa, it’s higher than I thought," he continued, his voice filled with trepidation. "Aha, but look at the view," Emily responded with excitement, "it’s worth the climb!"
Random Sample 1
Random Sample 2
Random Sample 3
Her hands shaking with excitement, Alice Monroe stuttered, "oh..I-I can’t believe it! Is this really my acceptance letter to
Harvard?" Marco cannot believe it either: "God damn it! How did you pull this off?"
Random Sample 1
Random Sample 2
Random Sample 3
Two samples selected from the Chinese test set randomly.
Using the Llasa-1b-250k model, we compared the results of direct inference and inference-time scaling. The two examples shown below were randomly selected from the seed-tts-eval test-hard.
We use Llasa-1b-250k for continuation experiment on the LibriSpeech test-clean dataset. The generated audio for each sample starts with the first 3 seconds of the ground truth audio, followed by the model's generated continuation.