Speech & Audio

ASR, TTS, audio generation, speech recognition

20 papers indexed

cs.SDcs.AIEmpiricalRecentJul 16, 2026

RW-Voice-EQ Bench: A Real World Benchmark for Evaluating Voice AI Systems

David Ayllon, Alice Baird, Jeffrey Brooks, Franc Camps-Febrer +10 more

The paper introduces the Real World Voice EQ Bench, a multidimensional benchmark for evaluating voice AI across text-to-speech, speech-to-speech, speech understanding, and automatic speech recognition…

View →

eess.AScs.AIcs.CLRecentMay 29, 2026

ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment

Jun-Hak Yun, Seung-Bin Kim, Seong-Whan Lee

ImmersiveTTS is an environment-aware text-to-speech model that generates natural speech seamlessly integrated within environmental contexts by explicitly modeling cross-modal interactions, achieving s…

View →

eess.ASEmpiricalRecentJun 18, 2026

Transcript-Free Flow-Matching Text-to-Speech via Speech Feature Conditioning

SooHwan Eom, Hee Suk Yoon, Eunseop Yoon, Mark Hasegawa-Johnson +1 more

The paper proposes RTFree-F5, a method to make flow-matching TTS models like F5-TTS independent of reference transcripts, improving performance and naturalness for dysarthric speakers.

View →

cs.SDcs.AIRecentJun 1, 2026

MOSS-Audio Technical Report

Chen Yang, Chufan Yu, Hanfu Chen, Jie Zhu +21 more

MOSS-Audio is a unified audio-language model designed for comprehensive understanding of speech, environmental sounds, and music, achieving strong performance across various audio-grounded tasks.

View →

cs.SDcs.CRRecentMay 2, 2026

MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech

Yutong Jin, Qi Li, Lingshuang Liu, Jianbing Ni

MelShield is a robust, in-generation audio watermarking framework that embeds identifiable signals into AI-generated speech in the Mel-spectrogram domain for reliable copyright protection and attribut…

View →

eess.ASEmpiricalRecentJun 19, 2026

Vaani Benchmark V1.0: An Inclusive Multimodal Benchmark Dataset for Hindi

Sujith Pulikodan, Agneedh Basu, Saurabh Kumar, Pranav Bhat +4 more

The paper introduces a new inclusive, multimodal Hindi ASR benchmark with real-world recordings and diverse demographic groups, enabling more robust and realistic evaluation.

View →

cs.SDcs.AIcs.CRRecentJun 4, 2026

Beyond Waveform Robustness: Robust Feature-Vocoder Adversarial Attacks on Automatic Speech Recognition

Yifan Liao, Zongmin Zhang, Zhen Sun, Yuhui Sun +2 more

The paper introduces a novel Clean-Referenced Feature-Vocoder Attack, a black-box adversarial attack that perturbs high-level SSL feature representations instead of raw audio waveforms, achieving supe…

View →

cs.CLcs.AIRecentMay 27, 2026

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

Haechan Kim, Seungjun Chung, Inkyu Park, Jihoo Lee +1 more

The paper introduces three new Korean speech benchmarks (KVoiceBench, KOpenAudioBench, and KMMAU) to evaluate SpeechLMs, demonstrating that English-centric evaluation fails to capture performance gaps…

View →

cs.CLcs.SDEmpiricalRecentJul 2, 2026

Reinforcement Learning for Data-Efficient Code-Switched ASR

Ziwei Ye, Peter Vickers

This paper proposes a reinforcement learning approach for adapting audio-language models to code-switched speech using group relative policy optimization and verifiable rewards.

View →

cs.SDcs.AIcs.CREmpiricalRecentJul 18, 2026

Do Speech Tokens Leak Voiceprints? Speaker Inversion Attacks Against End-to-End Speech Language Models

Ye Lu, Yihan Yan, Zhaoyang Zhang, Zhitao Ou +3 more

This paper introduces Audio BERT (AuB) and SpInv, methods for recovering embeddings from speech tokens and performing speaker inversion attacks using only three seconds of frontend output.

View →

eess.AScs.AIcs.SDRecentMay 27, 2026

LoSATok: Low-dimensional Semantic-Acoustic Tokenizer for Cross-Domain Audio Understanding and Generation

Zhisheng Zhang, Xiang Li, Yixuan Zhou, Jing Peng +2 more

LoSATok proposes a low-dimensional semantic-acoustic tokenizer that efficiently compresses high-dimensional audio features into a compact latent space, significantly improving the performance and effi…

View →

eess.ASEmpiricalRecentJul 20, 2026

X-Translator: A Real-Time Multilingual Speaker-Aware Speech-to-Speech Translation System

Yuxiang Zhao, Yichi Zhang, Yanjie An, Yanqiao Zhu +9 more

X-Translator is a low-cost modular system for real-time speech-to-speech translation, using streaming ASR, machine translation, and prompt-conditioned TTS, with session-level control.

View →

cs.AIRecentMay 27, 2026

Data-Efficient On-Policy Distillation for Automatic Speech Recognition

Yu Lin, Yiming Wang, Runyuan Cai, Xiaodong Zeng

The paper demonstrates that using on-policy distillation from a strong teacher model significantly improves the performance of compact Automatic Speech Recognition (ASR) models, achieving competitive…

View →

cs.SDcs.AIeess.ASRecentMay 29, 2026

Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS

Deokjin Seo, Gangin Park, Kihyun Nam

Chatterbox-Flash introduces a prior-calibrated block diffusion model for zero-shot TTS that achieves high-fidelity, streaming synthesis with significantly lower computational overhead than existing me…

View →

cs.SDcs.CLeess.ASEmpiricalRecentJul 19, 2026

Staged Depth-Pruning Distillation of a Flow-Matching Text-to-Speech Teacher: A Compact Hindi Speech Synthesizer

Sivateja Trikutam

This paper presents a method for building a compact Hindi text-to-speech model by pruning a large teacher model under a severe data budget, achieving state-of-the-art performance.

View →

cs.SDcs.AIRecentMay 30, 2026

Beyond the Mouth: Upper-Face Affective Cues in Audiovisual Sentence Recognition under Acoustic Uncertainty

Zhou Yang, Yueyi Yang

This paper investigates if upper-face affective cues enhance audiovisual sentence recognition, especially when audio is degraded, finding that while mouth cues are crucial for robustness, upper-face c…

View →

eess.ASEmpiricalRecentJun 16, 2026

An Analysis of the Effectiveness of Synthetic Speech Data for ASR Fine-tuning in Selected Indic Languages

Sujith Pulikodan, Agneedh Basu, Pavan Kumar, Pranav Bhat +3 more

This paper investigates the effectiveness of incorporating synthetic speech data in Automatic Speech Recognition (ASR) Systems for three Indic languages by analyzing performance gains, script sources,…

View →

eess.ASEmpiricalRecentJul 2, 2026

Enhancing Acoustic-to-Articulatory Inversion with Multi-Target Pretraining for Low-Resource Settings

Jesuraj Bandekar, Prasanta Kumar Ghosh

This paper proposes a novel pretraining method for Acoustic-to-Articulatory Inversion (AAI) using Phoneme Labels, Articulatory Feature Labels, and Critical-articulator Labels, improving performance an…

View →

cs.SDcs.AIEmpiricalRecentJul 21, 2026

What the Waveform Knows: Transparent-first Speech and Audio Intelligence with Caption Studio

Cheng Siong Chin, Jianhua Zhang, Mohan Venkateshkumar

Caption Studio is a transparency-first speech and audio intelligence platform that provides automated transcription, speaker diarization, speech analytics, signal-level audio analysis, and subtitle ge…

View →

cs.CLRecentMay 30, 2026

LaSR: Context-Aware Speech Recognition via Latent Reasoning

Heyang Liu, Ziyang Cheng, Jiayi Huang, Wenyang Xiao +4 more

The paper proposes LaSR, a context-aware training paradigm that uses latent reasoning to significantly improve speech recognition, especially for specialized terminology, without adding latency.

View →