Papers similar to 2606.01802

~ similar to 2606.01802· 7 results

cs.CLcs.AIeess.ASRecentMay 31, 2026

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu +3 more

PolySpeech-100 introduces a massive, multi-lingual benchmark covering 110 linguistic variants to rigorously test Speech-LLMs, demonstrating that open-source models struggle with low-resource languages…

View →

eess.AScs.AIcs.SDRecentMay 27, 2026

LoSATok: Low-dimensional Semantic-Acoustic Tokenizer for Cross-Domain Audio Understanding and Generation

Zhisheng Zhang, Xiang Li, Yixuan Zhou, Jing Peng +2 more

LoSATok proposes a low-dimensional semantic-acoustic tokenizer that efficiently compresses high-dimensional audio features into a compact latent space, significantly improving the performance and effi…

View →

cs.SDcs.AIcs.MMRecentMay 27, 2026

Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

Yuyue Wang, Xihua Wang, Xin Cheng, Yijing Chen +1 more

The paper introduces PlanAudio, a unified LLM-based framework that directly synthesizes natural, composite audio containing speech and sounds from unconstrained free-form text prompts, outperforming e…

View →

cs.CLcs.AIcs.SDRecentMay 28, 2026

MusTBENCH: Benchmarking and Advancing Temporal Grounding in Music LLMs

Daeyong Kwon, Qiyu Wu, Shinobu Kuriya, Junghyun Koo +5 more

The paper introduces MusTBENCH, a new benchmark, and MusT, an optimization recipe, to rigorously test and improve the ability of Large Audio-Language Models (LALMs) to accurately ground their musical…

View →

cs.CLcs.AIRecentMay 27, 2026

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

Haechan Kim, Seungjun Chung, Inkyu Park, Jihoo Lee +1 more

The paper introduces three new Korean speech benchmarks (KVoiceBench, KOpenAudioBench, and KMMAU) to evaluate SpeechLMs, demonstrating that English-centric evaluation fails to capture performance gaps…

View →

cs.CLcs.SDRecentMay 29, 2026

UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception

Yuhan Song, Linhao Zhang, Aiwei Liu, Chuhan Wu +5 more

UniAudio-Token is a framework that enhances existing semantic speech tokenizers with general audio perception, allowing them to handle diverse audio types while maintaining high-fidelity speech capabi…

View →

cs.SDcs.AIcs.CLRecentMay 28, 2026

COMET: Concept Space Dissection of the Modality Gap in Audio-Text Multimodal Contrastive Embeddings

Yonggang Zhu, Liting Gao, Aidong Men, Wenwu Wang

The paper introduces COMET, a novel PLS-SVD framework, to analyze the audio-text modality gap in CLAP models, showing that shared concepts are captured by a small subset of axes, and proposes a spectr…

View →