ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

11 results for “Understanding of speech processing and auto-encoder models”

CS papers only

Hybrid search: Keyword + semantic, ranked by combined score.ⓘ

Want pure semantic search? Try claim verification →

cs.SDcs.AIcs.CRRecentJun 4, 2026

Beyond Waveform Robustness: Robust Feature-Vocoder Adversarial Attacks on Automatic Speech Recognition

Yifan Liao, Zongmin Zhang, Zhen Sun, Yuhui Sun +2 more

The paper introduces a novel Clean-Referenced Feature-Vocoder Attack, a black-box adversarial attack that perturbs high-level SSL feature representations instead of raw audio waveforms, achieving supe…

View →
eess.AScs.AIcs.SDRecentMay 29, 2026

A Unified and Reproducible Experimentation Framework for Speech Understanding

Jing Peng, Junhao Du, Chenghao Wang, Hanqi Li +20 more

The paper introduces SURE, a unified framework designed to standardize and improve the comparability and reproducibility of evaluations for advanced speech understanding models.

View →
cs.CLcs.AIeess.ASRecentMay 31, 2026

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu +3 more

PolySpeech-100 introduces a massive, multi-lingual benchmark covering 110 linguistic variants to rigorously test Speech-LLMs, demonstrating that open-source models struggle with low-resource languages…

View →
eess.AScs.CRcs.LGRecentMay 4, 2026

Dimensionality-Aware Anomaly Detection in Learned Representations of Self-Supervised Speech Models

Sandra Arcos-Holzinger, Sarah M. Erfani, James Bailey, Sanjeev Khudanpur

The paper introduces GRIDS, a framework using Local Intrinsic Dimensionality (LID) to detect anomalies in self-supervised speech model representations, showing that LID elevation correlates with ASR d…

View →
eess.ASEmpiricalRecentJun 12, 2026

Unsupervised Approaches for Global Prosodic Embedding Extraction

Martin Meza, Luciana Ferrer, Pablo Riera

The paper proposes methods for generating global prosodic embeddings using auto-encoder models of pitch and energy, demonstrating competitive or superior performance under challenging conditions.

View →
cs.CLeess.ASRecentMay 30, 2026

SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors

Yekaterina Yegorova, Argyrios Gerogiannis, Haolong Zheng, Julia Hockenmaier +2 more

SALSA is a lightweight adaptation method that learns layer-wise steering vectors to significantly improve the performance of speech-aware LLMs on out-of-domain speech tasks.

View →
eess.AScs.AIcs.SDRecentMay 27, 2026

LoSATok: Low-dimensional Semantic-Acoustic Tokenizer for Cross-Domain Audio Understanding and Generation

Zhisheng Zhang, Xiang Li, Yixuan Zhou, Jing Peng +2 more

LoSATok proposes a low-dimensional semantic-acoustic tokenizer that efficiently compresses high-dimensional audio features into a compact latent space, significantly improving the performance and effi…

View →
eess.AScs.CLcs.SDRecentMay 30, 2026

Local Diagnostics of Continuous Normalizing Flow for Out-of-Distribution Detection

Xinwei Cao, Mengxuan Lu, Torbjørn Svendsen, Giampiero Salvi

The paper proposes a Lagrangian sub-flow (LSF) framework and geometric diagnostic signals to improve out-of-distribution detection using Continuous Normalizing Flows, overcoming the likelihood paradox…

View →
cs.SDcs.AIRecentJun 1, 2026

MOSS-Audio Technical Report

Chen Yang, Chufan Yu, Hanfu Chen, Jie Zhu +21 more

MOSS-Audio is a unified audio-language model designed for comprehensive understanding of speech, environmental sounds, and music, achieving strong performance across various audio-grounded tasks.

View →
cs.LGcs.AIeess.ASRecentMay 31, 2026

MURMUR: An Efficient Inference System for Long-Form ASR

Wei-Tzu Lee, Keisuke Kamahori, Baris Kasikci

Murmur is an efficient inference system for long-form ASR that resolves the accuracy-latency trade-off by optimizing both inter-chunk processing and intra-chunk attention mechanisms.

View →
cs.SDcs.AIeess.ASRecentMay 28, 2026

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

Bohan Li, Shi Lian, Hankun Wang, Yiwei Guo +5 more

HoliTok introduces a novel continuous holistic tokenization model that provides a unified, high-fidelity latent representation for simultaneously supporting both speech generation and speech understan…

View →