13 results for “Fundamentals of speaker diarization, speaker verification, and speaker identification”
CS papers onlyHybrid search: Keyword + semantic, ranked by combined score.ⓘ
Want pure semantic search? Try claim verification →
Minjae Lee, Hee-Soo Heo, Youngki Kwon, Han-Gyu Kim +2 more
The paper introduces Target Speaker Tagging (TST), a task that combines speaker diarization, verification, and identification into a single workflow for multi-speaker conversations. It presents TST-Be…
Echo is a joint-embedding predictive architecture that uses a single, pretrained ViT encoder to simultaneously perform speaker diarization, speech recognition, and dynamic source separation in a share…
A Wave-U-Net model is trained to extract a fundamental waveform from input speech signals for accurate and robust instantaneous pitch estimation.
The paper proposes a novel workflow to extract fine-grained regional accent features in Brazilian Portuguese using only acoustic labels and a phoneme-based forced aligner, showing that localized featu…
Pengcheng Zhou, Pianran Guo, Shuhua Chen, Mengqin Zhao +2 more
The paper proposes Domain-Aware Sharpness Minimization (DASM), a novel optimizer that enhances the robustness and generalization of voice stream steganalysis models across varying data distributions.
The paper introduces BEA-Dialogue+, an expanded 200-hour corpus for Hungarian conversational ASR, demonstrating that while larger data is challenging, specialized fine-tuning techniques significantly…
Yifan Liao, Zongmin Zhang, Zhen Sun, Yuhui Sun +2 more
The paper introduces a novel Clean-Referenced Feature-Vocoder Attack, a black-box adversarial attack that perturbs high-level SSL feature representations instead of raw audio waveforms, achieving supe…
MelShield is a robust, in-generation audio watermarking framework that embeds identifiable signals into AI-generated speech in the Mel-spectrogram domain for reliable copyright protection and attribut…
Bing Liu, Shunping Wang, Yufan Zhu, Xinyi Yu +4 more
This paper introduces 'implicit identity' as a unifying framework to survey and categorize LLM fingerprinting and watermarking techniques for verifying ownership and provenance across datasets, models…
The paper introduces GRIDS, a framework using Local Intrinsic Dimensionality (LID) to detect anomalies in self-supervised speech model representations, showing that LID elevation correlates with ASR d…
Haechan Kim, Seungjun Chung, Inkyu Park, Jihoo Lee +1 more
The paper introduces three new Korean speech benchmarks (KVoiceBench, KOpenAudioBench, and KMMAU) to evaluate SpeechLMs, demonstrating that English-centric evaluation fails to capture performance gaps…
Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu +3 more
PolySpeech-100 introduces a massive, multi-lingual benchmark covering 110 linguistic variants to rigorously test Speech-LLMs, demonstrating that open-source models struggle with low-resource languages…