11 results for “Understanding of speech processing and auto-encoder models”
CS papers onlyHybrid search: Keyword + semantic, ranked by combined score.ⓘ
Want pure semantic search? Try claim verification →
Yifan Liao, Zongmin Zhang, Zhen Sun, Yuhui Sun +2 more
The paper introduces a novel Clean-Referenced Feature-Vocoder Attack, a black-box adversarial attack that perturbs high-level SSL feature representations instead of raw audio waveforms, achieving supe…
Jing Peng, Junhao Du, Chenghao Wang, Hanqi Li +20 more
The paper introduces SURE, a unified framework designed to standardize and improve the comparability and reproducibility of evaluations for advanced speech understanding models.
Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu +3 more
PolySpeech-100 introduces a massive, multi-lingual benchmark covering 110 linguistic variants to rigorously test Speech-LLMs, demonstrating that open-source models struggle with low-resource languages…
The paper introduces GRIDS, a framework using Local Intrinsic Dimensionality (LID) to detect anomalies in self-supervised speech model representations, showing that LID elevation correlates with ASR d…
The paper proposes methods for generating global prosodic embeddings using auto-encoder models of pitch and energy, demonstrating competitive or superior performance under challenging conditions.
SALSA is a lightweight adaptation method that learns layer-wise steering vectors to significantly improve the performance of speech-aware LLMs on out-of-domain speech tasks.
Zhisheng Zhang, Xiang Li, Yixuan Zhou, Jing Peng +2 more
LoSATok proposes a low-dimensional semantic-acoustic tokenizer that efficiently compresses high-dimensional audio features into a compact latent space, significantly improving the performance and effi…
The paper proposes a Lagrangian sub-flow (LSF) framework and geometric diagnostic signals to improve out-of-distribution detection using Continuous Normalizing Flows, overcoming the likelihood paradox…
Chen Yang, Chufan Yu, Hanfu Chen, Jie Zhu +21 more
MOSS-Audio is a unified audio-language model designed for comprehensive understanding of speech, environmental sounds, and music, achieving strong performance across various audio-grounded tasks.
Murmur is an efficient inference system for long-form ASR that resolves the accuracy-latency trade-off by optimizing both inter-chunk processing and intra-chunk attention mechanisms.
Bohan Li, Shi Lian, Hankun Wang, Yiwei Guo +5 more
HoliTok introduces a novel continuous holistic tokenization model that provides a unified, high-fidelity latent representation for simultaneously supporting both speech generation and speech understan…