~ similar to 2605.31401· 18 results
This pilot study evaluates curator-guided multilingual art description using a small, on-premise VLM (Qwen2.5-VL-3B-Instruct) for German, Romanian, and Serbian, finding that language-specific adapters…
Zhipeng Cai, Zhuang Liu, Yunyang Xiong, Zechun Liu +2 more
The paper proposes VLM3, a simple, scalable method that demonstrates standard Vision Language Models (VLMs) can natively learn 3D understanding by focusing on architectural simplicity and specific dat…
The paper introduces MLLM-Microscope, a system that analyzes the internal structure of multimodal large language models (MLLMs), finding that modality fusion significantly impacts the linearity and di…
The paper analyzes token reduction for efficient unified VLM training, finding that while task-specific acceleration saves computation, it destroys the mutual performance gains achieved through joint…
This study systematically analyzes strategies for creating reliable multilingual LLMs-as-a-judge, finding that fine-tuning smaller models with in-domain data is effective, while zero-shot evaluation w…
The paper argues that benchmarking Vision-Language Models (VLMs) for urban perception must treat human disagreement and non-response as key measurement outcomes, rather than assuming perfect consensus…
Yusheng He, Jizhe Zhou, Xia Du, Zheng Lin +2 more
This paper systematically analyzes how different architectural components of Large Vision-Language Models (LVLMs) contribute to hallucination robustness, finding that joint enhancement of visual fidel…
Zamba2-VL is a new suite of vision-language models built on the Zamba2 hybrid architecture, achieving state-of-the-art performance and significantly improved inference efficiency compared to leading T…
Zijie Zhou, Dandan Zhu, Hangxiangpan Wang, Heng Zhang +2 more
The paper proposes AsyMoE, a novel Mixture of Experts architecture for Large Vision-Language Models that explicitly models the inherent asymmetry between visual and linguistic modalities, achieving si…
Zhikai Pan, Chih-Ting Liao, Chunrui Liu, Xi Xiao +4 more
The paper introduces a multilingual benchmark (MentalMap) to test if LLMs build internal spatial world models from text, finding a universal 'L3 reasoning cliff' suggesting that text-only working memo…
The paper evaluates the performance of Vision-Language Models (VLMs) in a collaborative dialogue task requiring spatial reconstruction, finding that while detailed text representations improve results…
The paper introduces UA-Legal-Bench, a comprehensive Ukrainian legal reasoning benchmark built from a massive judicial corpus, demonstrating that LLM performance is highly task-dependent and that simp…
The paper introduces Text-Conditioned Layer-wise Internal Alignment (TC-LIA), a model-agnostic method that significantly improves the detection of 'mirage'—when Vision-Language Models confidently answ…
Xiaoqi He, Kaixin Lan, Mu You, Tao Fang +2 more
The paper proposes MACAT, a Multi-Agent Culture-Aware Translation framework, to selectively translate culture-loaded words in ancient Chinese texts, achieving superior performance over existing method…
Marek Šuppa, Andrej Ridzik, Daniel Hládek, Natália Kňažeková +1 more
This paper introduces SkMTEB, a comprehensive text embedding benchmark for Slovak, and develops efficient, locally-deployable Slovak embeddings.
Lu Liu, Huiyu Duan, Chenxin Zhu, Jintong Lu +5 more
The paper introduces LL-Bench, a comprehensive benchmark for evaluating large-scale generative models on low-level vision tasks, and proposes LL-Score, an MLLM-based evaluator that better aligns quali…
The paper proposes an aggressive, parameter-efficient method to prune non-essential experts from Mixture-of-Experts (MoE) LLMs, significantly compressing the model while maintaining high machine trans…
This paper proposes a domain-specialized large language model, PoetryQwen, for precise translation and emotional understanding of classical poetry.