~ similar to 2605.31173· 5 results
Yizhuo Lu, Changde Du, Qiongyi Zhou, Liuyun Jiang +1 more
The paper proposes MindDiffuser, a two-stage framework that significantly improves image reconstruction from brain activity by combining semantic guidance from text-to-image models with structural ref…
ImmersiveTTS is an environment-aware text-to-speech model that generates natural speech seamlessly integrated within environmental contexts by explicitly modeling cross-modal interactions, achieving s…
Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu +3 more
PolySpeech-100 introduces a massive, multi-lingual benchmark covering 110 linguistic variants to rigorously test Speech-LLMs, demonstrating that open-source models struggle with low-resource languages…
Yuyue Wang, Xihua Wang, Xin Cheng, Yijing Chen +1 more
The paper introduces PlanAudio, a unified LLM-based framework that directly synthesizes natural, composite audio containing speech and sounds from unconstrained free-form text prompts, outperforming e…
Yuhan Song, Linhao Zhang, Aiwei Liu, Chuhan Wu +5 more
UniAudio-Token is a framework that enhances existing semantic speech tokenizers with general audio perception, allowing them to handle diverse audio types while maintaining high-fidelity speech capabi…