HPRO: Hierarchical Progressive Reward Optimization via
Preference Extraction for Emotional Text-to-Speech

Anonymous submission to Interspeech 2026


Code (GitHub)

Abstract

Recently, LLM-based Text-to-Speech (TTS) has achieved high naturalness. However, Supervised Fine-Tuning often converges to statistically averaged prosody, limiting emotional expressiveness. Preference-driven optimization offers a promising alternative, yet existing approaches suffer from two mismatches: information conflict, where content and emotion in a shared latent space produce conflicting gradients; and scale gap, where sparse sentence-level rewards struggle to guide dense frame-level generation. To address the information conflict, we introduce the HD-Emo codec as a differentiable reward model to extract distinct preference tokens, isolating emotional optimization from semantic sequences. To handle the scale gap, we propose HPRO, a hierarchical progressive reward optimization framework, employing the HD-Emo codec to progressively align frame-, word-, and sentence-level objectives. Experiments demonstrate that HPRO achieves high emotional expressiveness with robust stability.


Model Architecture

Motivation

Figure 1: Motivation. (a) DiffRO Framework. Single-scale reward optimization for monolithic speech representation.
(b) HPRO Framework. Hierarchical progressive reward optimization for structured preference spaces.

HD-Emo Codec

Figure 2: HD-Emo Codec. Monotonic speech tokens are processed by dual preference extractors with FSQ bottlenecks to obtain content and style preference tokens.

HPRO Framework

Figure 3: HPRO framework. The LLM generates differentiable speech tokens via Gumbel-Softmax, which are mapped by HD-Emo codec into preference spaces. Hierarchical frame-, word-, and sentence-level rewards are progressively applied to update the LLM.

Comparative Experiment

To demonstrate the effectiveness of our proposed approach, we conducted a comparative study of several popular TTS models.
The evaluated systems include:

Evaluation Metrics. For each generated audio sample, objective scores are reported following the metrics used in our paper:

* The scores are shown under each audio sample in the format: [wVAD-CCC, EMO-SIM, DNSMOS].

💡 Note on Semantic-Driven Emotion

In the following examples, the Truth and Prompt audio for some samples feature different speakers. This setup demonstrates HPRO's ability to generate high-intensity emotional prosody purely guided by textual semantics, proving that the emotional expression stems from deep linguistic understanding rather than acoustic imitation of the reference prompt.

For a direct comparison, we also provide TokenRecon (as described in our paper), which reconstructs the ground-truth speech tokens using the reference prompt, serving as the theoretical upper bound for semantic-driven emotional expressiveness.

Sample 1

🎭 Emotion: Happy

🗣️ Prompt Text: "Just because you think you're right doesn't mean you are."

📝 Content Text: "The morning dew glittered like tiny diamonds, signaling a day filled with promise."

Prompt Truth
TokenRecon HPRO CosyVoice2 CosyVoice3 IndexTTS2 HD-PPT

[0.514, 1.00, 3.98]

[0.406, 0.72, 3.97]

[0.380, 0.95, 3.86]

[0.125, 0.03, 3.80]

[0.174, 0.00, 3.95]

[0.513, 0.20, 3.96]


Sample 2

🎭 Emotion: Happy

🗣️ Prompt Text: "The sight of the empty stage left us baffled. The performer was missing."

📝 Content Text: "Oh, wow. Did you see how cute the puppies are? Just heartwarming."

Prompt Truth
TokenRecon HPRO CosyVoice2 CosyVoice3 IndexTTS2 HD-PPT

[0.769, 1.00, 4.13]

[0.403, 1.00, 3.95]

[0.332, 1.00, 3.74]

[0.418, 1.00, 3.57]

[0.340, 1.00, 3.57]

[0.327, 1.00, 4.01]


Sample 3

🎭 Emotion: Angry

🗣️ Prompt Text: "The old Swinset creaked, echoing years of forgotten laughter and play."

📝 Content Text: "He slammed a door with a force that echoed his simmering rage."

Prompt Truth
TokenRecon HPRO CosyVoice2 CosyVoice3 IndexTTS2 HD-PPT

[0.388, 0.09, 4.13]

[0.239, 1.00, 4.03]

[0.258, 0.00, 4.38]

[0.262, 0.00, 4.20]

[0.869, 1.00, 3.57]

[0.314, 1.00, 3.98]


Sample 4

🎭 Emotion: Angry

🗣️ Prompt Text: "Why are you yelling at me for something that wasn't even my idea? Why? Why?"

📝 Content Text: "How many times do we have to go over this?"

Prompt Truth
TokenRecon HPRO CosyVoice2 CosyVoice3 IndexTTS2 HD-PPT

[0.603, 0.35, 3.57]

[0.211, 0.94, 3.97]

[0.005, 0.52, 3.84]

[-0.207, 0.35, 3.35]

[0.160, 0.42, 3.39]

[0.453, 0.35, 4.04]


Sample 5

🎭 Emotion: Surprised

🗣️ Prompt Text: "How can you be so consistently unreliable?"

📝 Content Text: "Wait, the train's arriving on time for once? That's unheard of!"

Prompt Truth
TokenRecon HPRO CosyVoice2 CosyVoice3 IndexTTS2 HD-PPT

[0.444, 0.70, 4.10]

[0.561, 0.82, 4.02]

[0.153, 0.81, 3.93]

[0.042, 0.78, 4.20]

[0.333, 0.71, 3.42]

[0.087, 0.74, 4.07]


Sample 6

🎭 Emotion: Surprised

🗣️ Prompt Text: "Our fleet of pastel clouds sailed across the sky, painting expressions of tranquil bliss."

📝 Content Text: "Look, a double rainbow stretching across the entire sky!"

Prompt Truth
TokenRecon HPRO CosyVoice2 CosyVoice3 IndexTTS2 HD-PPT

[0.456, 0.03, 3.98]

[0.266, 0.98, 3.89]

[0.351, 0.02, 3.90]

[0.364, 0.36, 3.83]

[0.113, 1.00, 3.53]

[0.444, 0.03, 4.09]


Sample 7

🎭 Emotion: Sad

🗣️ Prompt Text: "Why does it always feel like the world is ending when the sun sets like this?"

📝 Content Text: "She said nothing, but her silence screamed volumes."

Prompt Truth
TokenRecon HPRO CosyVoice2 CosyVoice3 IndexTTS2 HD-PPT

[0.927, 1.00, 3.93]

[0.326, 0.00, 3.98]

[-0.260, 0.00, 3.90]

[0.155, 0.11, 4.08]

[0.248, 0.00, 3.53]

[0.422, 0.00, 3.89]


Sample 8

🎭 Emotion: Fearful

🗣️ Prompt Text: "Crashed underfoot, a reminder of what was, but is no longer."

📝 Content Text: "Is that sound supposed to be coming from the attic?"

Prompt Truth
TokenRecon HPRO CosyVoice2 CosyVoice3 IndexTTS2 HD-PPT

[0.599, 1.00, 3.85]

[0.616, 0.00, 4.07]

[0.284, 0.99, 3.70]

[0.509, 0.99, 3.71]

[0.335, 0.28, 3.71]

[0.247, 1.00, 3.85]


Sample 9

🎭 Emotion: Neutral

🗣️ Prompt Text: "Um at that happened quick, he didn't know with time who knows."

📝 Content Text: "Because how could you say that to, you know, a child you know, but she didn't know any better, you know what I mean, she"

Prompt Truth
TokenRecon HPRO CosyVoice2 CosyVoice3 IndexTTS2 HD-PPT

[0.669, 1.00, 3.84]

[0.452, 1.00, 3.66]

[0.461, 0.01, 3.61]

[0.437, 0.00, 4.06]

[0.043, 0.03, 3.53]

[0.506, 1.00, 3.95]