~ similar to 2606.01802· 7 results
Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu +3 more
PolySpeech-100 introduces a massive, multi-lingual benchmark covering 110 linguistic variants to rigorously test Speech-LLMs, demonstrating that open-source models struggle with low-resource languages…
Zhisheng Zhang, Xiang Li, Yixuan Zhou, Jing Peng +2 more
LoSATok proposes a low-dimensional semantic-acoustic tokenizer that efficiently compresses high-dimensional audio features into a compact latent space, significantly improving the performance and effi…
Yuyue Wang, Xihua Wang, Xin Cheng, Yijing Chen +1 more
The paper introduces PlanAudio, a unified LLM-based framework that directly synthesizes natural, composite audio containing speech and sounds from unconstrained free-form text prompts, outperforming e…
Daeyong Kwon, Qiyu Wu, Shinobu Kuriya, Junghyun Koo +5 more
The paper introduces MusTBENCH, a new benchmark, and MusT, an optimization recipe, to rigorously test and improve the ability of Large Audio-Language Models (LALMs) to accurately ground their musical…
Haechan Kim, Seungjun Chung, Inkyu Park, Jihoo Lee +1 more
The paper introduces three new Korean speech benchmarks (KVoiceBench, KOpenAudioBench, and KMMAU) to evaluate SpeechLMs, demonstrating that English-centric evaluation fails to capture performance gaps…
Yuhan Song, Linhao Zhang, Aiwei Liu, Chuhan Wu +5 more
UniAudio-Token is a framework that enhances existing semantic speech tokenizers with general audio perception, allowing them to handle diverse audio types while maintaining high-fidelity speech capabi…
The paper introduces COMET, a novel PLS-SVD framework, to analyze the audio-text modality gap in CLAP models, showing that shared concepts are captured by a small subset of axes, and proposes a spectr…