~ similar to 2605.29430· 11 results
The paper introduces Script-Normalized WER (SN-WER), a novel evaluation metric that transliterates ASR transcripts into a canonical script to accurately measure speech recognition performance across d…
Jing Peng, Junhao Du, Chenghao Wang, Hanqi Li +20 more
The paper introduces SURE, a unified framework designed to standardize and improve the comparability and reproducibility of evaluations for advanced speech understanding models.
Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu +3 more
PolySpeech-100 introduces a massive, multi-lingual benchmark covering 110 linguistic variants to rigorously test Speech-LLMs, demonstrating that open-source models struggle with low-resource languages…
Haechan Kim, Seungjun Chung, Inkyu Park, Jihoo Lee +1 more
The paper introduces three new Korean speech benchmarks (KVoiceBench, KOpenAudioBench, and KMMAU) to evaluate SpeechLMs, demonstrating that English-centric evaluation fails to capture performance gaps…
Heyang Liu, Ziyang Cheng, Jiayi Huang, Wenyang Xiao +4 more
The paper proposes LaSR, a context-aware training paradigm that uses latent reasoning to significantly improve speech recognition, especially for specialized terminology, without adding latency.
This study benchmarks token-optimized formats (TOON and TRON) against JSON in end-to-end agentic AI systems, finding that TRON significantly reduces token overhead with minimal performance degradation…
Yuyue Wang, Xihua Wang, Xin Cheng, Yijing Chen +1 more
The paper introduces PlanAudio, a unified LLM-based framework that directly synthesizes natural, composite audio containing speech and sounds from unconstrained free-form text prompts, outperforming e…
Yilun Qiu, Xiaoyan Zhao, Yang Zhang, Yuxin Chen +6 more
The paper introduces PARL, a framework that learns personalized evaluation rubrics directly from raw user interaction histories to accurately assess how well LLM outputs align with subjective, user-sp…
The paper introduces AGENTCL, a rigorous evaluation framework that uses controlled task streams to accurately measure an agent's ability to accumulate and reuse knowledge across multiple tasks, thereb…
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.
This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.