Speech & Audio
ASR, TTS, audio generation, speech recognition
20 papers indexed
ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment
ImmersiveTTS is an environment-aware text-to-speech model that generates natural speech seamlessly integrated within environmental contexts by explicitly modeling cross-modal interactions, achieving s…
MOSS-Audio Technical Report
Chen Yang, Chufan Yu, Hanfu Chen, Jie Zhu +21 more
MOSS-Audio is a unified audio-language model designed for comprehensive understanding of speech, environmental sounds, and music, achieving strong performance across various audio-grounded tasks.
MelShield: Robust Mel-Domain Audio Watermarking for Provenance Attribution of AI Generated Synthesized Speech
MelShield is a robust, in-generation audio watermarking framework that embeds identifiable signals into AI-generated speech in the Mel-spectrogram domain for reliable copyright protection and attribut…
Beyond Waveform Robustness: Robust Feature-Vocoder Adversarial Attacks on Automatic Speech Recognition
Yifan Liao, Zongmin Zhang, Zhen Sun, Yuhui Sun +2 more
The paper introduces a novel Clean-Referenced Feature-Vocoder Attack, a black-box adversarial attack that perturbs high-level SSL feature representations instead of raw audio waveforms, achieving supe…
LoSATok: Low-dimensional Semantic-Acoustic Tokenizer for Cross-Domain Audio Understanding and Generation
Zhisheng Zhang, Xiang Li, Yixuan Zhou, Jing Peng +2 more
LoSATok proposes a low-dimensional semantic-acoustic tokenizer that efficiently compresses high-dimensional audio features into a compact latent space, significantly improving the performance and effi…
KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs
Haechan Kim, Seungjun Chung, Inkyu Park, Jihoo Lee +1 more
The paper introduces three new Korean speech benchmarks (KVoiceBench, KOpenAudioBench, and KMMAU) to evaluate SpeechLMs, demonstrating that English-centric evaluation fails to capture performance gaps…
Beyond the Mouth: Upper-Face Affective Cues in Audiovisual Sentence Recognition under Acoustic Uncertainty
This paper investigates if upper-face affective cues enhance audiovisual sentence recognition, especially when audio is degraded, finding that while mouth cues are crucial for robustness, upper-face c…
Chatterbox-Flash: Prior-Calibrated Block Diffusion for Streaming Zero-Shot TTS
Chatterbox-Flash introduces a prior-calibrated block diffusion model for zero-shot TTS that achieves high-fidelity, streaming synthesis with significantly lower computational overhead than existing me…
Data-Efficient On-Policy Distillation for Automatic Speech Recognition
The paper demonstrates that using on-policy distillation from a strong teacher model significantly improves the performance of compact Automatic Speech Recognition (ASR) models, achieving competitive…
LaSR: Context-Aware Speech Recognition via Latent Reasoning
Heyang Liu, Ziyang Cheng, Jiayi Huang, Wenyang Xiao +4 more
The paper proposes LaSR, a context-aware training paradigm that uses latent reasoning to significantly improve speech recognition, especially for specialized terminology, without adding latency.
Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts
Yuyue Wang, Xihua Wang, Xin Cheng, Yijing Chen +1 more
The paper introduces PlanAudio, a unified LLM-based framework that directly synthesizes natural, composite audio containing speech and sounds from unconstrained free-form text prompts, outperforming e…
PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects
Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu +3 more
PolySpeech-100 introduces a massive, multi-lingual benchmark covering 110 linguistic variants to rigorously test Speech-LLMs, demonstrating that open-source models struggle with low-resource languages…
MURMUR: An Efficient Inference System for Long-Form ASR
Murmur is an efficient inference system for long-form ASR that resolves the accuracy-latency trade-off by optimizing both inter-chunk processing and intra-chunk attention mechanisms.
Sparse Autoencoders for Interpretable Emotion Control in Text-to-Speech
Hongfei Du, Jiacheng Shi, Sidi Lu, Gang Zhou +1 more
The paper uses sparse autoencoders to identify specific latent features within LLM-based TTS models, enabling interpretable and fine-grained control over emotional expression by intervening in small s…
Multimodal Music Recommendation System using LLMs
The paper proposes a novel multimodal framework for session-based music recommendation that jointly models audio, lyric, and semantic content signals within a unified LLM-based sequential reasoning sy…
UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception
Yuhan Song, Linhao Zhang, Aiwei Liu, Chuhan Wu +5 more
UniAudio-Token is a framework that enhances existing semantic speech tokenizers with general audio perception, allowing them to handle diverse audio types while maintaining high-fidelity speech capabi…
Escaping the Linearity Trap: Manifold Detours for Black-Box Adversarial Attacks on Singing Audio Deepfake Detection
Yifan Liao, Yule Liu, Zhen Sun, Zongmin Zhang +4 more
The paper introduces MARS, a novel meta-adversarial framework that significantly improves black-box adversarial attacks against state-of-the-art Singing Voice Deepfake Detection (SVDD) systems by esca…
Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation
This paper provides a unified taxonomy and controlled empirical evaluation of jailbreak attacks and defenses for Large Audio Language Models (LALMs), demonstrating that safety evaluation must consider…
SALSA: Speech Aware LLM Adaptation via Learned Steering Activation Vectors
SALSA is a lightweight adaptation method that learns layer-wise steering vectors to significantly improve the performance of speech-aware LLMs on out-of-domain speech tasks.
HAIM: Human-AI Music Datasets for AI Music Production Tracking Benchmark
The paper introduces HAIM, a new benchmark dataset designed to move AI music detection beyond simple binary classification by tracking specific stages and types of AI integration in music production.