HD-PPT: Hierarchical Decoding of Content- and Prompt-Preference Tokens
for instruction-based TTS

Anonymous submission to ICASSP 2026

Abstract

Large Language Model (LLM)-based Text-To-speech (TTS) models already reached a high degree of naturalness. However, the precision control of TTS inference is still challenging. Although instruction-based Text-to-Speech (Instruct-TTS) models are proposed, these models still lack fine-grained control due to the modality gap between single-level text instructions and multilevel speech tokens. To address this question, we propose HD-PPT, a framework that transforms speech synthesis into a structured, hierarchical task. To enable fine-grained control, we introduce a novel speech codec to extract distinct prompt-preference and content-preference tokens from the complex speech tokens, supervised by automatic speech recognition (ASR) and cross-lingual audio-text pre-training (CLAP) objectives. To bridge the modality gap of these tokens, we proposed hierarchical decoding strategy, where the LLM generates tokens in a structured order: first semantic, then fine-grained style, and finally complete acoustic representation. Extensive experiments demonstrate that this hierarchical paradigm significantly improves instruction adherence and achieves state-of-the-art naturalness, validating our approach for precise and controllable speech synthesis.

Model Architecture

Figure 1: Comparison between the previous and the proposed HD-PPT frameworks.

Figure 2: An overview of our proposed speech token codec, which extracts dual-preference tokens from speech tokens.

Figure 3: An overview of our proposed HD-PPT, comprising the hierarchical LLM and a subsequent vocoder.

Comparative Experiment

To demonstrate the effectiveness of our proposed approach, we conducted a comparative study of several popular retrieval methods on HD-PPT.
The methods we compared include:
(1) PromptStyle.
(2) PromptTTS.
(3) CosyVoice.
(4) CosyVoice2.
(5) EmoVoice-PP.

Sample 1

Prompt Text: With a composed demeanor, she spoke slowly.

Content Text: "Haven't they any children?" asked Constance indifferently.

Groundtruth	Proposed	PromptStyle	PromptTTS	CosyVoice	CosyVoice2	EmoVoice-PP

Sample 2

Prompt Text: Speaking at a regular pace, the mad male speaker employs a high-pitched voice with moderate energy.

Content Text: Over them swooped the eagles.

Groundtruth	Proposed	PromptStyle	PromptTTS	CosyVoice	CosyVoice2	EmoVoice-PP

Sample 3

Prompt Text: Emphasizing bliss that resonates in harmonic, cheerful tones.

Content Text: When the world sings in harmony like a well-tuned orchestra embracing the air.

Groundtruth	Proposed	PromptStyle	PromptTTS	CosyVoice	CosyVoice2	EmoVoice-PP

Sample 4

Prompt Text: Depicting the ache of isolation with suppressed, trembling sadness.

Content Text: Nobody looks up anymore; they just pass by like you're not even there.

Groundtruth	Proposed	PromptStyle	PromptTTS	CosyVoice	CosyVoice2	EmoVoice-PP

Sample 5

Prompt Text: Echoing with an involuntary awe and the shock of unexpected beauty.

Content Text: A flock of birds erupted from the tree, a sudden shadow against the sky.

Groundtruth	Proposed	PromptStyle	PromptTTS	CosyVoice	CosyVoice2	EmoVoice

Sample 6

Prompt Text: Imparting an anxious atmosphere with quivering undertones of dread.

Content Text: Shadows crawled across the walls, growing bolder as the night deepened.

Groundtruth	Proposed	PromptStyle	PromptTTS	CosyVoice	CosyVoice2	EmoVoice-PP

Ablation Study

To demonstrate the effectiveness of our proposed approach, we conducted an ablation study on HD-PPT.
We conducted tests on the following configurations:
(1) w/o Content-Pref.: removing prompt-preference tokens.
(2) w/o Prompt-Pref.: removing prompt-preference tokens.
(3) w/o Dual-Pref.: bypassing both preference tokens.
(4) w/o Instruct Text: generating speech without the style prompt.
(5) Parallel: all tokens are decoded simultaneously from the LLM's hidden states.
(6) Single-step: only speech tokens are decoded from the LLM's hidden states.