ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2605.31173· 5 results

cs.CVcs.AIRecentMay 28, 2026

Versatile Framework with Semantic and Structural guidance for Image Reconstruction from Brain Activity

Yizhuo Lu, Changde Du, Qiongyi Zhou, Liuyun Jiang +1 more

The paper proposes MindDiffuser, a two-stage framework that significantly improves image reconstruction from brain activity by combining semantic guidance from text-to-image models with structural ref…

View →
eess.AScs.AIcs.CLRecentMay 29, 2026

ImmersiveTTS: Environment-Aware Text-to-Speech with Multimodal Diffusion Transformer and Domain-Specific Representation Alignment

Jun-Hak Yun, Seung-Bin Kim, Seong-Whan Lee

ImmersiveTTS is an environment-aware text-to-speech model that generates natural speech seamlessly integrated within environmental contexts by explicitly modeling cross-modal interactions, achieving s…

View →
cs.CLcs.AIeess.ASRecentMay 31, 2026

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu +3 more

PolySpeech-100 introduces a massive, multi-lingual benchmark covering 110 linguistic variants to rigorously test Speech-LLMs, demonstrating that open-source models struggle with low-resource languages…

View →
cs.SDcs.AIcs.MMRecentMay 27, 2026

Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

Yuyue Wang, Xihua Wang, Xin Cheng, Yijing Chen +1 more

The paper introduces PlanAudio, a unified LLM-based framework that directly synthesizes natural, composite audio containing speech and sounds from unconstrained free-form text prompts, outperforming e…

View →
cs.CLcs.SDRecentMay 29, 2026

UniAudio-Token: Empowering Semantic Speech Tokenizers with General Audio Perception

Yuhan Song, Linhao Zhang, Aiwei Liu, Chuhan Wu +5 more

UniAudio-Token is a framework that enhances existing semantic speech tokenizers with general audio perception, allowing them to handle diverse audio types while maintaining high-fidelity speech capabi…

View →