Papers similar to 2606.01016

~ similar to 2606.01016· 7 results

cs.CLcs.AIRecentMay 27, 2026

KVoiceBench, KOpenAudioBench, and KMMAU: Agent-Driven Korean Speech Benchmarks for Evaluating SpeechLMs

Haechan Kim, Seungjun Chung, Inkyu Park, Jihoo Lee +1 more

The paper introduces three new Korean speech benchmarks (KVoiceBench, KOpenAudioBench, and KMMAU) to evaluate SpeechLMs, demonstrating that English-centric evaluation fails to capture performance gaps…

View →

eess.AScs.AIcs.SDRecentMay 29, 2026

A Unified and Reproducible Experimentation Framework for Speech Understanding

Jing Peng, Junhao Du, Chenghao Wang, Hanqi Li +20 more

The paper introduces SURE, a unified framework designed to standardize and improve the comparability and reproducibility of evaluations for advanced speech understanding models.

View →

cs.CLcs.CYcs.HCRecentJun 1, 2026

WAXAL-NET: Finetuned Edge ASR Across 19 African Languages

Victor Tolulope Olufemi, Oreoluwa Babatunde, Ramsey Njema, Bolarinwa Gbotemi +27 more

This paper demonstrates that compact, domain-specialized Automatic Speech Recognition (ASR) models significantly outperform large, general-purpose foundation models for conversational speech across 19…

View →

cs.CLRecentJun 1, 2026

SN-WER: Script-Normalized WER for Multi-Script Indic ASR Evaluation

Priyaranjan Pattnayak

The paper introduces Script-Normalized WER (SN-WER), a novel evaluation metric that transliterates ASR transcripts into a canonical script to accurately measure speech recognition performance across d…

View →

cs.SDcs.AIeess.ASRecentMay 28, 2026

HoliTok:A Coutinuous Holistic Tokenization with Robust Dual Capabilities of Speech Generation and Understanding

Bohan Li, Shi Lian, Hankun Wang, Yiwei Guo +5 more

HoliTok introduces a novel continuous holistic tokenization model that provides a unified, high-fidelity latent representation for simultaneously supporting both speech generation and speech understan…

View →

cs.SDcs.AIRecentJun 1, 2026

MOSS-Audio Technical Report

Chen Yang, Chufan Yu, Hanfu Chen, Jie Zhu +21 more

MOSS-Audio is a unified audio-language model designed for comprehensive understanding of speech, environmental sounds, and music, achieving strong performance across various audio-grounded tasks.

View →

cs.CLcs.SDRecentMay 29, 2026

UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception

Yuhan Song, Linhao Zhang, Aiwei Liu, Chuhan Wu +5 more

UniAudio-Token is a framework that enhances existing semantic speech tokenizers with general audio perception, allowing them to handle diverse audio types while maintaining high-fidelity speech capabi…

View →