eess.ASEmpirical

Unsupervised Approaches for Global Prosodic Embedding Extraction

Martin Meza, Luciana Ferrer, Pablo Riera

Jun 12, 2026

AI Summaryllama-3.1-8b-instruct

The paper proposes methods for generating global prosodic embeddings using auto-encoder models of pitch and energy, demonstrating competitive or superior performance under challenging conditions.

Proposing new methods for generating global prosodic embeddings using auto-encoder models

Keywords

prosody speech auto-encoder pitch energy

Before reading this…

Understanding of speech processing and auto-encoder models

Applications

→Analyzing the role of prosody in tasks
→Speech synthesis systems

Skill Ladder

To understand this paper, make sure you know these concepts first:

Understanding of speech processing and auto-encoder modelsfind papers →

Abstract

More Like This

Prosody is central to oral communication, conveying information like the emotional state of the speaker and cues needed for meaning disambiguation. Many self-supervised models of speech produce embeddings that encode prosodic as well as linguistic, and speaker information. This entanglement of information is problematic in scenarios where prosody is the main distinguishing factor while other factors may vary between training and deployment; in such cases, a purely prosodic representation would be more robust. Such representation could also be used for analyzing the role of prosody in a given task or as input to speech synthesis systems. In this work, we propose a variety of approaches for producing global prosodic embeddings based on auto-encoder models of pitch and energy. We develop a benchmark for assessing the performance of these representations, showing that our embeddings provide competitive or superior performance under challenging conditions, compared to various alternatives.