"Vision-Language Models" | ArxivCSExplorer

20 results for “Vision-Language Models”

CS papers only

Hybrid search: Keyword + semantic, ranked by combined score.ⓘ

Want pure semantic search? Try claim verification →

cs.CLRecentMay 29, 2026

"Înţelegi Româneşte?'' A Recipe for Romanian Vision-Language Models

Mihai Masala, Marius Leordeanu, Mihai Dascalu, Traian Rebedea

This paper details the systematic construction and training of a high-performing Romanian Vision-Language Model (VLM), demonstrating that language-specific adaptation significantly boosts performance…

View →

cs.CVcs.AIRecentMay 28, 2026

VLM3: Vision Language Models Are Native 3D Learners

Zhipeng Cai, Zhuang Liu, Yunyang Xiong, Zechun Liu +2 more

The paper proposes VLM3, a simple, scalable method that demonstrates standard Vision Language Models (VLMs) can natively learn 3D understanding by focusing on architectural simplicity and specific dat…

View →

cs.CVEmpiricalRecentJul 7, 2026

Vision as Unified Multimodal Generation

Xiaoyang Han, Jianhua Li, Kewang Deng, Zukai Chen +13 more

The paper presents SenseNova-Vision, a unified multimodal model for computer vision tasks using natural language instructions and optional visual prompts, trained primarily on a new corpus and requiri…

View →

cs.CVcs.AIRecentMay 29, 2026

Zamba2-VL Technical Report

Hassan Shapourian, Kasra Hejazi, Olabode M. Sule, Beren Millidge

Zamba2-VL is a new suite of vision-language models built on the Zamba2 hybrid architecture, achieving state-of-the-art performance and significantly improved inference efficiency compared to leading T…

View →

cs.CLcs.AIRecentMay 30, 2026

MLLM-Microscope: Unlocking Hidden Structure Within Multimodal Large Language Models

Ravil Mussabayev, Rustam Mussabayev

The paper introduces MLLM-Microscope, a system that analyzes the internal structure of multimodal large language models (MLLMs), finding that modality fusion significantly impacts the linearity and di…

View →

cs.CVcs.AIRecentMay 30, 2026

Benchmarks for Vision-Language Models in Urban Perception Should Be Reliability-Aware and Negotiated

Rashid Mushkani

The paper argues that benchmarking Vision-Language Models (VLMs) for urban perception must treat human disagreement and non-response as key measurement outcomes, rather than assuming perfect consensus…

View →

cs.CVcs.AIRecentMay 29, 2026

Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models

Zijie Zhou, Dandan Zhu, Hangxiangpan Wang, Heng Zhang +2 more

The paper proposes AsyMoE, a novel Mixture of Experts architecture for Large Vision-Language Models that explicitly models the inherent asymmetry between visual and linguistic modalities, achieving si…

View →

cs.CVcs.AIcs.CLEmpiricalRecentJul 17, 2026

An Exam for Active Observers

Jiarui Zhang, Muzi Tao, Shangshang Wang, Ollie Liu +2 more

The paper introduces ActiveVision, a benchmark to measure active observation in multimodal large language models, and shows that current models lack robust active visual perception.

View →

cs.CLcs.RORecentMay 29, 2026

Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial Reasoning, But Only Barely

Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen

The paper evaluates the performance of Vision-Language Models (VLMs) in a collaborative dialogue task requiring spatial reconstruction, finding that while detailed text representations improve results…

View →

cs.CVcs.AIRecentMay 31, 2026

An Open-Source Benchmark and Baseline for Multi-temporal Referring Segmentation

Bingyu Li, Da Zhang, Tao Huo, Zhiyuan Zhao +2 more

The paper introduces Multi-temporal Referring Segmentation (MTRS), a new task requiring models to segment language-described temporal changes, and proposes MTRefSeg-R1, a specialized framework that ac…

View →

cs.CVDatasetRecentJul 7, 2026

MonoIR-RS: Infrared Remote Sensing Vision-Language Learning with CLIP and VLM Adaptation

Jiaju Han, Ma Yaqi, Yahui Chai, Xuemeng Sun +7 more

This paper introduces MonoIR-RS, a large-scale infrared remote-sensing vision-language dataset and benchmark for understanding infrared imagery.

View →

cs.CVcs.HCEmpiricalRecentJul 17, 2026

Attention-Guided Saliency Maps for Interpreting Visualization Literacy in VLMs

Maeve Hutchinson, Abderrahmane Wassim Mehdaoui, Pranava Madhyastha

This paper introduces a method for generating diagnostic saliency maps for vision-language models using transformer models, revealing how the models allocate focus across visual elements during answer…

View →

cs.CVcs.AIcs.CLRecentMay 31, 2026

On the Limits of Token Reduction for Efficient Unified Vision Language Training

Siyi Chen, Weiming Zhuang, Jingtao Li, Lingjuan Lv

The paper analyzes token reduction for efficient unified VLM training, finding that while task-specific acceleration saves computation, it destroys the mutual performance gains achieved through joint…

View →

cs.CVcs.CLEmpiricalRecentJul 22, 2026

Test-Time Training for Modality Order Consistency in Vision-Language Models

Aditi Gupta, Yossi Gandelsman

This paper identifies modality-order sensitivity as a failure in vision-language models and introduces a test-time training method to mitigate it, resulting in improved performance.

View →

cs.ROEmpiricalRecentJul 17, 2026

Vision-Language-Motion Maps: An Open-Vocabulary, Uncertainty-Aware, Queryable Motion Attribute for 3D Scene Maps

Dibyendu Ghosh, Ayushi Shakya

This paper introduces Vision-Language-Motion Maps (VLMM), an open-vocabulary, natural-language-queryable 3D map with fused motion attributes and per-element uncertainty, which outperforms semantic-onl…

View →

cs.CVRecentJun 1, 2026

Vision-language Models for Driver Monitoring Systems: A Driver Activity Description Dataset

David J. Lerch, Sarath Mulugurthi, Manuel Martin, Frederik Diederichs +1 more

The paper addresses the difficulty of using general vision-language models (VLMs) for fine-grained driver behavior recognition by creating a new, richly described dataset and demonstrating that fine-t…

View →

cs.CLcs.AIcs.DSRecentMay 29, 2026

Neuro-symbolic Syntactic Parsing: Shaping a Neural Network with the CYK Algorithm

Fabio Massimo Zanzotto, Federico Ranaldi, Giorgio Satta

The paper proposes CYKNN, a novel recurrent neural network architecture that directly encodes the CYK parsing algorithm, demonstrating superior performance over large language models on syntactic pars…

View →

cs.CVRecentJun 4, 2026

PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding

Shaohui Dai, Yansong Qu, You Shen, Shengchuan Zhang +1 more

The paper introduces PAR3D, a unified part-aware 3D-MLLM framework, to enhance 3D scene understanding by enabling models to reason about and ground both whole objects and their fine-grained parts.

View →