Multimodal

Vision-language models, audio-visual learning, and cross-modal reasoning

20 papers indexed

cs.AIcs.CLcs.LGRecentMay 28, 2026

MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization

Anisha Saha, Varsha Suresh, Teodora Kamova, Sophia Wiedmann +2 more

The paper introduces MuPHI, a dataset and MuPHIRM, a reasoning-augmented training framework, to improve Vision-Language Models' ability to detect and reason about subtle, context-dependent multimodal…

View →

cs.CVcs.AIRecentMay 29, 2026

Detect Before You Leap: Mirage Detection in Vision-Language Models

Sayeed Shafayet Chowdhury, Md. Shaown Miah

The paper introduces Text-Conditioned Layer-wise Internal Alignment (TC-LIA), a model-agnostic method that significantly improves the detection of 'mirage'—when Vision-Language Models confidently answ…

View →

cs.CVcs.LGEmpiricalRecentJul 8, 2026

MedPMC: A Systematic Framework for Scaling High-Fidelity Medical Multimodal Data for Foundation Models

Hyunjae Kim, Dain Kim, Pan Xiao, Serina S. Applebaum +24 more

The paper introduces MedPMC, a framework that transforms permissively licensed literature into high-fidelity infrastructure for medical multimodal models, resulting in improved performance on various…

View →

cs.CVDatasetRecentJul 7, 2026

MonoIR-RS: Infrared Remote Sensing Vision-Language Learning with CLIP and VLM Adaptation

Jiaju Han, Ma Yaqi, Yahui Chai, Xuemeng Sun +7 more

This paper introduces MonoIR-RS, a large-scale infrared remote-sensing vision-language dataset and benchmark for understanding infrared imagery.

View →

cs.CVcs.AIRecentMay 28, 2026

Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset

Qian Chen, Xianyin Zhang, Yanzhi Liu, Lifan Guo +2 more

This paper introduces CFMME, a comprehensive Chinese financial multimodal benchmark, and evaluates current Large Vision-Language Models (LVLMs), finding that while state-of-the-art models perform mode…

View →

cs.CVcs.AIcs.GRRecentMay 31, 2026

3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code

Yipeng Gao, Lei Shu, Genzhi Ye, Xi Xiong +4 more

The paper introduces 3DCodeBench, a systematic benchmark and platform for evaluating Vision-Language Model (VLM) agents' ability to generate procedural 3D models from text and images using code.

View →

cs.CVcs.AIcs.MARecentMay 29, 2026

Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence

Yuhan Wang, Shuochen Chang, Yalin Feng, Dongsheng Ma +7 more

The paper proposes EAGLE, a novel evidence-aligned multi-agent framework, demonstrating that requiring shared visual evidence among agents is crucial for achieving reliable and trustworthy consensus i…

View →

cs.CRcs.LGRecentMay 5, 2026

Laundering AI Authority with Adversarial Examples

Jie Zhang, Pura Peetathawatchai, Florian Tramèr, Avital Shafran

The paper demonstrates that adversarial examples can be used to manipulate Vision-Language Models (VLMs) into confidently providing authoritative but incorrect information, a process termed 'AI author…

View →

cs.CVcs.AIcs.CLRecentJun 1, 2026

Cross-modal linkage risk in clinical vision-language models

Soroosh Tayebi Arasteh, Mahshad Lotfinia, Sven Nebelung, Daniel Truhn

The paper demonstrates that clinical vision-language models (VLMs) pose a significant privacy risk by allowing de-identified images to be re-linked to original reports, and proposes a targeted differe…

View →

cs.CVcs.AIcs.CRRecentApr 12, 2026

Toward Accountable AI-Generated Content on Social Platforms: Steganographic Attribution and Multimodal Harm Detection

Xinlei Guan, David Arosemena, Tejaswi Dhandu, Kuan Huang +6 more

The paper proposes an end-to-end forensic pipeline using steganographic attribution and multimodal harm detection to reliably trace and attribute harmful misuse of AI-generated imagery on social platf…

View →

cs.IRcs.DLEmpiricalRecentJul 22, 2026

Using Hierarchical Controlled Vocabularies to Understand CLIP Retrieval Failures in Historical Photo Collections

Ratan Sebastian, Anett Hoppe, Christoph Rippe, Ralph Ewerth

This paper investigates how the structural properties of controlled vocabularies like the Art and Architecture Thesaurus impact the performance of vision-language models like CLIP for content-based im…

View →

cs.CRcs.CVRecentMay 15, 2026

A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation

Hao Yang, Zhuo Ma, Yang Liu, Yilong Yang +2 more

The paper introduces CrossMPI, a novel cross-modal prompt injection attack that uses image-only perturbations to steer the interpretation of both textual and visual inputs in Large Vision-Language Mod…

View →

cs.CVcs.AIRecentMay 29, 2026

Variational Adapter for Cross-modal Similarity Representation

WenZhang Wei, Zhipeng Gui, Dehua Peng, Tiandi Ye +1 more

The paper proposes a Variational Adapter (VACSR) to improve cross-modal similarity representation by treating fine-grained image-text matching as a variational inference problem, thereby mitigating th…

View →

cs.CLRecentMay 29, 2026

"Înţelegi Româneşte?'' A Recipe for Romanian Vision-Language Models

Mihai Masala, Marius Leordeanu, Mihai Dascalu, Traian Rebedea

This paper details the systematic construction and training of a high-performing Romanian Vision-Language Model (VLM), demonstrating that language-specific adaptation significantly boosts performance…

View →

cs.CRcs.AIRecentMar 17, 2026

Adversarial attacks against Modern Vision-Language Models

Alejandro Paredes La Torre

The paper evaluates the adversarial robustness of two open-source Vision-Language Models (LLaVA and Qwen2.5-VL) in a simulated e-commerce environment, finding that while LLaVA is vulnerable to gradien…

View →

cs.CVcs.AIRecentMay 28, 2026

Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models

Kai Bian, Xucheng Guo, Bin Chen, Lingyan Ruan +3 more

The paper introduces Pocket-Dentist, an efficiency-aware benchmark and model that demonstrates that compact, smaller Vision-Language Models (VLMs) can outperform larger models in accuracy while drasti…

View →

cs.CLcs.AIcs.CRRecentApr 30, 2026

One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness

Hiroyuki Deguchi, Katsuki Chousa, Yusuke Sakai

The paper proposes a method to identify 'hub texts' that exploit vulnerabilities in cross-modal encoders, demonstrating that a single text can achieve unrealistically high similarity scores across div…

View →

cs.CVcs.AIRecentMay 29, 2026

Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models

Zijie Zhou, Dandan Zhu, Hangxiangpan Wang, Heng Zhang +2 more

The paper proposes AsyMoE, a novel Mixture of Experts architecture for Large Vision-Language Models that explicitly models the inherent asymmetry between visual and linguistic modalities, achieving si…

View →

cs.CLcs.RORecentMay 29, 2026

Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial Reasoning, But Only Barely

Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen

The paper evaluates the performance of Vision-Language Models (VLMs) in a collaborative dialogue task requiring spatial reconstruction, finding that while detailed text representations improve results…

View →

cs.CRcs.AIcs.LGNEWEmpiricalJul 28, 2026

Architectural Backdoors in Vision-Language Model Supply Chains via Representation Steering

Maria Rosaria Briglia, Igor Maljkovic, Antonio Emanuele Cinà, Luca Oneto +2 more

Researchers demonstrate how malicious providers can embed architectural backdoors into Vision--Language Model (VLM) supply chains through representation steering.

View →