Unsupervised Approaches for Global Prosodic Embedding Extraction
The paper proposes methods for generating global prosodic embeddings using auto-encoder models of pitch and energy, demonstrating competitive or superior performance under challenging conditions.
Proposing new methods for generating global prosodic embeddings using auto-encoder models
Before reading this…
Applications
- →Analyzing the role of prosody in tasks
- →Speech synthesis systems
To understand this paper, make sure you know these concepts first:
- Understanding of speech processing and auto-encoder modelsfind papers →
Abstract
More Like ThisProsody is central to oral communication, conveying information like the emotional state of the speaker and cues needed for meaning disambiguation. Many self-supervised models of speech produce embeddings that encode prosodic as well as linguistic, and speaker information. This entanglement of information is problematic in scenarios where prosody is the main distinguishing factor while other factors may vary between training and deployment; in such cases, a purely prosodic representation would be more robust. Such representation could also be used for analyzing the role of prosody in a given task or as input to speech synthesis systems. In this work, we propose a variety of approaches for producing global prosodic embeddings based on auto-encoder models of pitch and energy. We develop a benchmark for assessing the performance of these representations, showing that our embeddings provide competitive or superior performance under challenging conditions, compared to various alternatives.