~ similar to 2605.27984· 8 results
Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu +3 more
PolySpeech-100 introduces a massive, multi-lingual benchmark covering 110 linguistic variants to rigorously test Speech-LLMs, demonstrating that open-source models struggle with low-resource languages…
Jing Peng, Junhao Du, Chenghao Wang, Hanqi Li +20 more
The paper introduces SURE, a unified framework designed to standardize and improve the comparability and reproducibility of evaluations for advanced speech understanding models.
Yanjie An, Yuxiang Zhao, Yichi Zhang, Qixi Zheng +4 more
The paper introduces OpenSTBench, a unified, multidimensional evaluation framework designed to comprehensively compare heterogeneous speech translation systems by jointly assessing translation, speech…
Zixuan Jiang, Yanqiao Zhu, Peng Wang, Qinyuan Chen +7 more
The paper proposes Agentic ASR, a closed-loop framework that treats ASR as a multi-turn refinement task, significantly improving semantic accuracy over traditional token-level metrics.
The paper introduces a diagnostic framework to decompose multilingual LLM performance variance, showing that language identity and model-benchmark interactions are key drivers of performance gaps.
Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more
The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…
The paper introduces Script-Normalized WER (SN-WER), a novel evaluation metric that transliterates ASR transcripts into a canonical script to accurately measure speech recognition performance across d…
Nahyun Lee, Dongkeun Yoon, Guijin Son, Geewook Kim +11 more
The paper introduces K-BrowseComp, a new web-browsing agent benchmark of 400 problems grounded in Korean contexts, demonstrating that current frontier LLMs struggle significantly with complex, context…