Llasa: Scaling Train-Time and Inference-Time Compute for

Llama-based Speech Synthesis

Abstract.The success of modern AI demonstrates that the most effective approaches are those capable of scaling seamlessly with increased computational resources, rather than relying heavily on human-engineered domain-specific knowledge. In speech synthesis, prior systems often exploit inductive biases to manually disentangle speech into separate factors and adopt multi-stage pipelines that model each component with a dedicated module (e.g., using LLMs for semantics and diffusion models for acoustics). Moreover, such pipelines inherently complicate decisions on how and where to scale computational resources. This raises a natural question: can speech modeling benefit from the same compute-driven scaling laws without relying on hand-crafted modular designs? To investigate, we introduce Llasa, a unified TTS framework that models all aspects of speech within a single autoregressive Transformer. At its core is X-Codec2, a single-layer vector quantizer that encodes all speech information—content, prosody, timbre—into a single discrete token stream. This minimalist, end-to-end design eliminates the need for hand-engineered modules and naturally aligns with compute scaling. Our experiments reveal that scaling train-time compute for Llasa consistently improves the system's text understanding ability and in-context learning ability. From the perspective of scaling inference-time compute, we employ speech understanding models as verifiers during the search, finding that increasing inference-time compute leads to consistent improvements in the quality of synthesized speech. In addition, we release the checkpoint and training code for our TTS model and codec model publicly.

Contents

Codec Reconstruction Samples

Sample GT Xcodec2 StableCodec WavTokenizer_40tps WavTokenizer_75tps Xcodec_nq1 Xcodec_nq2 BigCodec DAC_16k_nq1 DAC_16k_nq2 DAC_16k_nq12 Encodec_nq2 Encodec_nq8 Mimi_nq4 Mimi_nq6 Mimi_nq8 SemanticCodec SpeechTokenizer_nq1 SpeechTokenizer_nq2
Sample 1
Sample 2
Sample 3
Sample 4
Sample 5
Sample 6
Sample 7
Sample 8
Sample 9
Sample 10
Sample 11
Sample 12

Scaling Train-Time Compute

We randomly selected two samples from the English test set. All synthesized audio was generated solely from the input text (without any speech prompts), and each model was sampled three times at random to specifically evaluate its text comprehension ability. The table below presents the results across models of various sizes and training data amounts.

Sample Llasa-1b-80k Llasa-1b-160k Llasa-1b-250k Llasa-3b-250k Llasa-8b-250k
"Uh, are you sure about this?" Tim asked nervously, looking at the steep slope before them. "Whoa, it’s higher than I thought," he continued, his voice filled with trepidation. "Aha, but look at the view," Emily responded with excitement, "it’s worth the climb!"
Random Sample 1
Random Sample 2
Random Sample 3
Her hands shaking with excitement, Alice Monroe stuttered, "oh..I-I can’t believe it! Is this really my acceptance letter to Harvard?" Marco cannot believe it either: "God damn it! How did you pull this off?"
Random Sample 1
Random Sample 2
Random Sample 3

Two samples selected from the Chinese test set randomly.

Sample Llasa-1b-80k Llasa-1b-160k Llasa-1b-250k Llasa-3b-250k Llasa-8b-250k
帘外雨潺潺,春意阑珊。罗衾不耐五更寒。梦里不知身是客,一晌贪欢。独自莫凭栏,无限江山。别时容易见时难。流水落花春去也,天上人间。
Random Sample 1
Random Sample 2
Random Sample 3
人要是行,干一行行一行,一行行行行行,行行行干哪行都行,要是不行,干一行不行一行,一行不行行行不行,行行不行,干哪行都不行。
Random Sample 1
Random Sample 2
Random Sample 3

Scaling Inference-Time Compute

Samples for fig3 using ensemble verifiers, Using the Llasa-1b-250k model under different inference-time compute settings.
The Liang family’s sheep knock down the Jiang family’s wall; the wall hurts the sheep. Liang wants Jiang to pay for the sheep, Jiang wants Liang to rebuild the wall.
20 21 22 23 24 25 26
In the southeast corner stands a cold temple; a straw hat hangs on the beam. Leave the temple—wear the hat. Enter again—take it off.
20 21 22 23 24 25 26
Deep into that darkness peering, Long I stood there, wondering, fearing, Doubting, dreaming dreams no mortals. Ever dared to dream before; But the silence was unbroken, And the stillness gave no token, And the only word there spoken. Was the whispered word, "Lenore!" This I whispered, and an echo. Murmured back the word, "Lenore!" Merely this, and nothing more.
20 21 22 23 24 25 26
The detective’s voice, full of determination and fire, was heard loud and clear in the room, "No one will tell me what I can or cannot do. I’ll prove them all wrong! Get me my gun. What are you all looking at me for?"
20 21 22 23 24 25 26
Samples for fig4 Using the Llasa-1b-250k model, we compared the results of direct inference and inference-time scaling. The two examples shown below were randomly selected from the seed-tts-eval test-hard.

Target Text Prompt Directly Inference Scaling Inference-Time Compute
喇嘛与哑巴 打南边来了个哑巴,腰里别了个喇叭; 打北边来了个喇嘛,手里提了个獭犸. 提着獭犸的喇嘛要拿獭犸换别着喇叭的哑巴的喇叭; 别着喇叭的哑巴不愿拿喇叭换提着獭犸的喇嘛的獭犸. 不知是别着喇叭的哑巴打了提着獭犸的喇嘛一喇叭; 还是提着獭犸的喇嘛打了别着喇叭的哑巴一獭犸. 喇嘛回家炖獭犸; 哑巴嘀嘀哒哒吹喇叭
高高山上一座庙,住了八个出家人,八个道人都有名:大弟子,叫凳大,二弟子,叫大凳,三弟子,叫猴三,四弟子,叫三猴,五弟子,叫瓶茶,六弟子,叫茶瓶,七弟子,叫冰别边,八弟子,叫边别冰。凳大会打鼓,大凳会撞钟,猴三会烧火,三猴会点灯;瓶茶会吹管,茶瓶会吹笙;冰别边会煮饭,边别冰会念经。大凳要打凳大鼓,凳大要撞大凳钟;三猴要烧猴三火,猴三要点三猴灯;茶瓶要吹瓶茶管,瓶茶要吹茶瓶笙;边别冰要煮冰别边的饭,冰别边要念边别冰的经。大凳打不好凳大的鼓,凳大撞不好大凳的钟;三猴烧不好猴三的火,猴三点不好三猴的灯;茶瓶吹不好瓶茶的管,瓶茶吹不好茶瓶的笙;边别冰煮不好冰别边的饭,冰别边念不好边别冰的经。凳大还打凳大鼓,大凳还撞大凳钟;猴三还烧猴三火,三猴还点三猴灯;瓶茶还吹瓶茶管,茶瓶还吹茶瓶笙;冰别边还煮冰别边的饭,边别冰还念边别冰的经。各人还干各一行,白白争个脸红脖子青。

Zero-shot In-context Learning using Llasa-1b-250k with direct inference(Prompts from demo page of seed-tts)

Language Prompt Same Language Generation Cross-linugal Generation
EN
I don't really care what you call me. I've been a silent spectator, watching species evolve, empires rise and fall. But always remember, I am mighty and enduring. Respect me and I'll nurture you; ignore me and you shall face the consequences.

顿时,气氛变得沉郁起来。乍看之下,一切的困扰仿佛都围绕在我身边。我皱着眉头,感受着那份压力,但我知道我不能放弃,不能认输。于是,我深吸一口气,心底的声音告诉我:“无论如何,都要冷静下来,重新开始。”

Dealing with family secrets is never easy. Yet, sometimes, omission is a form of protection, intending to safeguard some from the harsh truths. One day, I hope you understand the reasons behind my actions. Until then, Anna, please, bear with me.

处理家庭秘密从来都不是一件容易的事。然而,有时候,隐瞒是一种保护形式,旨在保护一些人免受残酷的真相伤害。有一天,我希望你能理解我行为背后的原因。在那之前,安娜,请容忍我。

The combinations of different textures and flavors create a perfect harmony. The succulence of the steak, the tartness of the cranberries, the crunch of pine nuts, and creaminess of blue cheese make it a truly delectable delight. Enjoy your culinary adventure!

听着你的话,我心里五味杂陈。虽然我愿意一直在你身边,承担一切不幸,但我知道只有让你自己面对,才能真正让你变得更强大。所以,你要记得,无论面对何种困难,都请你坚强,我会在心里一直支持你的。
ZH
突然,身边一阵笑声。我看着他们,意气风发地挺直了胸膛,甩了甩那稍显肉感的双臂,轻笑道:"我身上的肉,是为了掩饰我爆棚的魅力,否则,岂不吓坏了你们呢?"

Suddenly, there was a burst of laughter beside me. I looked at them, stood up straight with high spirit, shook the slightly fleshy arms, and smiled lightly, saying, "The flesh on my body is to hide my bursting charm. Otherwise, wouldn't it scare you?"

他闭上眼睛,期望这一切都能过去。然而,当他再次睁开眼睛,眼前的景象让他不禁倒吸一口气。雾气中出现的禁闭岛,陌生又熟悉,充满未知的危险。他握紧拳头,心知他的生活即将发生翻天覆地的改变。

He closed his eyes, expecting that all of this could pass. However, when he opened his eyes again, the sight in front of him made him couldn't help but take a deep breath. The closed island that appeared in the fog, strange and familiar, was full of unknown dangers. He tightened his fist, knowing that his life was about to undergo earth-shaking changes.

顿时,气氛变得沉郁起来。乍看之下,一切的困扰仿佛都围绕在我身边。我皱着眉头,感受着那份压力,但我知道我不能放弃,不能认输。于是,我深吸一口气,心底的声音告诉我:“无论如何,都要冷静下来,重新开始。”

Suddenly, the atmosphere became gloomy. At first glance, all the troubles seemed to surround me. I frowned, feeling that pressure, but I know I can't give up, can't admit defeat. So, I took a deep breath, and the voice in my heart told me, "Anyway, must calm down and start again."