HPRO: Hierarchical Progressive Reward Optimization via
Preference Extraction for Emotional Text-to-Speech

Anonymous submission to Interspeech 2026

Code (GitHub)

Abstract

Recently, LLM-based Text-to-Speech (TTS) has achieved high naturalness. However, Supervised Fine-Tuning often converges to statistically averaged prosody, limiting emotional expressiveness. Preference-driven optimization offers a promising alternative, yet existing approaches suffer from two mismatches: information conflict, where content and emotion in a shared latent space produce conflicting gradients; and scale gap, where sparse sentence-level rewards struggle to guide dense frame-level generation. To address the information conflict, we introduce the HD-Emo codec as a differentiable reward model to extract distinct preference tokens, isolating emotional optimization from semantic sequences. To handle the scale gap, we propose HPRO, a hierarchical progressive reward optimization framework, employing the HD-Emo codec to progressively align frame-, word-, and sentence-level objectives. Experiments demonstrate that HPRO achieves high emotional expressiveness with robust stability.

Model Architecture

Motivation

Figure 1: Motivation. (a) DiffRO Framework. Single-scale reward optimization for monolithic speech representation.
(b) HPRO Framework. Hierarchical progressive reward optimization for structured preference spaces.

HD-Emo Codec

Figure 2: HD-Emo Codec. Monotonic speech tokens are processed by dual preference extractors with FSQ bottlenecks to obtain content and style preference tokens.

HPRO Framework

Figure 3: HPRO framework. The LLM generates differentiable speech tokens via Gumbel-Softmax, which are mapped by HD-Emo codec into preference spaces. Hierarchical frame-, word-, and sentence-level rewards are progressively applied to update the LLM.

Comparative Experiment

To demonstrate the effectiveness of our proposed approach, we conducted a comparative study of several popular TTS models.
The evaluated systems include:

TokenRecon
HPRO (Ours)
CosyVoice2
CosyVoice3
IndexTTS2
HD-PPT

Evaluation Metrics. For each generated audio sample, objective scores are reported following the metrics used in our paper:

wVAD-CCC: Word-level Valence-Arousal-Dominance Concordance Correlation Coefficient.
EMO-SIM: Emotion Similarity.
DNSMOS: DNSMOS P.835.

* The scores are shown under each audio sample in the format: [wVAD-CCC, EMO-SIM, DNSMOS].

💡 Note on Semantic-Driven Emotion

In the following examples, the Truth and Prompt audio for some samples feature different speakers. This setup demonstrates HPRO's ability to generate high-intensity emotional prosody purely guided by textual semantics, proving that the emotional expression stems from deep linguistic understanding rather than acoustic imitation of the reference prompt.

For a direct comparison, we also provide TokenRecon (as described in our paper), which reconstructs the ground-truth speech tokens using the reference prompt, serving as the theoretical upper bound for semantic-driven emotional expressiveness.

Sample 1

🎭 Emotion: Happy

🗣️ Prompt Text: "Just because you think you're right doesn't mean you are."

📝 Content Text: "The morning dew glittered like tiny diamonds, signaling a day filled with promise."

Prompt	Truth

TokenRecon	HPRO	CosyVoice2	CosyVoice3	IndexTTS2	HD-PPT
[0.514, 1.00, 3.98]	[0.406, 0.72, 3.97]	[0.380, 0.95, 3.86]	[0.125, 0.03, 3.80]	[0.174, 0.00, 3.95]	[0.513, 0.20, 3.96]

Sample 2

🎭 Emotion: Happy

🗣️ Prompt Text: "The sight of the empty stage left us baffled. The performer was missing."

📝 Content Text: "Oh, wow. Did you see how cute the puppies are? Just heartwarming."

Prompt	Truth

TokenRecon	HPRO	CosyVoice2	CosyVoice3	IndexTTS2	HD-PPT
[0.769, 1.00, 4.13]	[0.403, 1.00, 3.95]	[0.332, 1.00, 3.74]	[0.418, 1.00, 3.57]	[0.340, 1.00, 3.57]	[0.327, 1.00, 4.01]

Sample 3

🎭 Emotion: Angry

🗣️ Prompt Text: "The old Swinset creaked, echoing years of forgotten laughter and play."

📝 Content Text: "He slammed a door with a force that echoed his simmering rage."

Prompt	Truth

TokenRecon	HPRO	CosyVoice2	CosyVoice3	IndexTTS2	HD-PPT
[0.388, 0.09, 4.13]	[0.239, 1.00, 4.03]	[0.258, 0.00, 4.38]	[0.262, 0.00, 4.20]	[0.869, 1.00, 3.57]	[0.314, 1.00, 3.98]

Sample 4

🎭 Emotion: Angry

🗣️ Prompt Text: "Why are you yelling at me for something that wasn't even my idea? Why? Why?"

📝 Content Text: "How many times do we have to go over this?"

Prompt	Truth

TokenRecon	HPRO	CosyVoice2	CosyVoice3	IndexTTS2	HD-PPT
[0.603, 0.35, 3.57]	[0.211, 0.94, 3.97]	[0.005, 0.52, 3.84]	[-0.207, 0.35, 3.35]	[0.160, 0.42, 3.39]	[0.453, 0.35, 4.04]

Sample 5

🎭 Emotion: Surprised

🗣️ Prompt Text: "How can you be so consistently unreliable?"

📝 Content Text: "Wait, the train's arriving on time for once? That's unheard of!"

Prompt	Truth

TokenRecon	HPRO	CosyVoice2	CosyVoice3	IndexTTS2	HD-PPT
[0.444, 0.70, 4.10]	[0.561, 0.82, 4.02]	[0.153, 0.81, 3.93]	[0.042, 0.78, 4.20]	[0.333, 0.71, 3.42]	[0.087, 0.74, 4.07]

Sample 6

🎭 Emotion: Surprised

🗣️ Prompt Text: "Our fleet of pastel clouds sailed across the sky, painting expressions of tranquil bliss."

📝 Content Text: "Look, a double rainbow stretching across the entire sky!"

Prompt	Truth

TokenRecon	HPRO	CosyVoice2	CosyVoice3	IndexTTS2	HD-PPT
[0.456, 0.03, 3.98]	[0.266, 0.98, 3.89]	[0.351, 0.02, 3.90]	[0.364, 0.36, 3.83]	[0.113, 1.00, 3.53]	[0.444, 0.03, 4.09]

Sample 7

🎭 Emotion: Sad

🗣️ Prompt Text: "Why does it always feel like the world is ending when the sun sets like this?"

📝 Content Text: "She said nothing, but her silence screamed volumes."

Prompt	Truth

TokenRecon	HPRO	CosyVoice2	CosyVoice3	IndexTTS2	HD-PPT
[0.927, 1.00, 3.93]	[0.326, 0.00, 3.98]	[-0.260, 0.00, 3.90]	[0.155, 0.11, 4.08]	[0.248, 0.00, 3.53]	[0.422, 0.00, 3.89]

Sample 8

🎭 Emotion: Fearful

🗣️ Prompt Text: "Crashed underfoot, a reminder of what was, but is no longer."

📝 Content Text: "Is that sound supposed to be coming from the attic?"

Prompt	Truth

TokenRecon	HPRO	CosyVoice2	CosyVoice3	IndexTTS2	HD-PPT
[0.599, 1.00, 3.85]	[0.616, 0.00, 4.07]	[0.284, 0.99, 3.70]	[0.509, 0.99, 3.71]	[0.335, 0.28, 3.71]	[0.247, 1.00, 3.85]

Sample 9

🎭 Emotion: Neutral

🗣️ Prompt Text: "Um at that happened quick, he didn't know with time who knows."

📝 Content Text: "Because how could you say that to, you know, a child you know, but she didn't know any better, you know what I mean, she"

Prompt	Truth

TokenRecon	HPRO	CosyVoice2	CosyVoice3	IndexTTS2	HD-PPT
[0.669, 1.00, 3.84]	[0.452, 1.00, 3.66]	[0.461, 0.01, 3.61]	[0.437, 0.00, 4.06]	[0.043, 0.03, 3.53]	[0.506, 1.00, 3.95]

HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech