20 results for “Vision-Language Models”
CS papers onlyHybrid search: Keyword + semantic, ranked by combined score.ⓘ
Want pure semantic search? Try claim verification →
This paper details the systematic construction and training of a high-performing Romanian Vision-Language Model (VLM), demonstrating that language-specific adaptation significantly boosts performance…
Zhipeng Cai, Zhuang Liu, Yunyang Xiong, Zechun Liu +2 more
The paper proposes VLM3, a simple, scalable method that demonstrates standard Vision Language Models (VLMs) can natively learn 3D understanding by focusing on architectural simplicity and specific dat…
Zamba2-VL is a new suite of vision-language models built on the Zamba2 hybrid architecture, achieving state-of-the-art performance and significantly improved inference efficiency compared to leading T…
The paper introduces MLLM-Microscope, a system that analyzes the internal structure of multimodal large language models (MLLMs), finding that modality fusion significantly impacts the linearity and di…
The paper argues that benchmarking Vision-Language Models (VLMs) for urban perception must treat human disagreement and non-response as key measurement outcomes, rather than assuming perfect consensus…
Zijie Zhou, Dandan Zhu, Hangxiangpan Wang, Heng Zhang +2 more
The paper proposes AsyMoE, a novel Mixture of Experts architecture for Large Vision-Language Models that explicitly models the inherent asymmetry between visual and linguistic modalities, achieving si…
The paper evaluates the performance of Vision-Language Models (VLMs) in a collaborative dialogue task requiring spatial reconstruction, finding that while detailed text representations improve results…
Bingyu Li, Da Zhang, Tao Huo, Zhiyuan Zhao +2 more
The paper introduces Multi-temporal Referring Segmentation (MTRS), a new task requiring models to segment language-described temporal changes, and proposes MTRefSeg-R1, a specialized framework that ac…
The paper analyzes token reduction for efficient unified VLM training, finding that while task-specific acceleration saves computation, it destroys the mutual performance gains achieved through joint…
The paper addresses the difficulty of using general vision-language models (VLMs) for fine-grained driver behavior recognition by creating a new, richly described dataset and demonstrating that fine-t…
The paper proposes CYKNN, a novel recurrent neural network architecture that directly encodes the CYK parsing algorithm, demonstrating superior performance over large language models on syntactic pars…
Shaohui Dai, Yansong Qu, You Shen, Shengchuan Zhang +1 more
The paper introduces PAR3D, a unified part-aware 3D-MLLM framework, to enhance 3D scene understanding by enabling models to reason about and ground both whole objects and their fine-grained parts.
Yuhan Wang, Shuochen Chang, Yalin Feng, Dongsheng Ma +7 more
The paper proposes EAGLE, a novel evidence-aligned multi-agent framework, demonstrating that requiring shared visual evidence among agents is crucial for achieving reliable and trustworthy consensus i…
This pilot study evaluates curator-guided multilingual art description using a small, on-premise VLM (Qwen2.5-VL-3B-Instruct) for German, Romanian, and Serbian, finding that language-specific adapters…
The paper proposes Continuous Reasoning for Vision-Language-Action (VLA) models, arguing that effective reasoning must be a shared, verifiable internal latent space rather than discrete text tokens, l…
Yifei Zuo, Dhruv Pai, Zhichen Zeng, Alec Dewulf +2 more
The paper introduces Parallax, a scalable and numerically stable parameterized Local Linear Attention mechanism that significantly improves LLM performance and efficiency compared to existing methods…
The paper introduces GPIC, a massive, permissively licensed, and safety-filtered image corpus of 28 trillion pixels, designed to serve as a stable and accessible benchmark for large-scale visual gener…
Qian Chen, Xianyin Zhang, Yanzhi Liu, Lifan Guo +2 more
This paper introduces CFMME, a comprehensive Chinese financial multimodal benchmark, and evaluates current Large Vision-Language Models (LVLMs), finding that while state-of-the-art models perform mode…
Xiaona Zhou, Muntasir Wahed, Tianjiao Yu, Constantin Brif +1 more
The paper introduces VisAnomReasoner, a parameter-efficient Vision-Language Model (VLM), trained on a new benchmark (VisAnomBench) to accurately and interpretably detect anomalies in time-series data.
Hee Suk Yoon, Eunseop Yoon, Jaehyun Jang, SooHwan Eom +5 more
The paper proposes Visual Gradient Steering (VGS), a method that decomposes the distillation loss into language and visual components and steers the optimization to prioritize visual grounding, signif…