Multimodal
Vision-language models, audio-visual learning, and cross-modal reasoning
20 papers indexed
MuPHI: Learning Implicit Multimodal Harm Reasoning via Semantically Grounded Reward Optimization
Anisha Saha, Varsha Suresh, Teodora Kamova, Sophia Wiedmann +2 more
The paper introduces MuPHI, a dataset and MuPHIRM, a reasoning-augmented training framework, to improve Vision-Language Models' ability to detect and reason about subtle, context-dependent multimodal…
Detect Before You Leap: Mirage Detection in Vision-Language Models
The paper introduces Text-Conditioned Layer-wise Internal Alignment (TC-LIA), a model-agnostic method that significantly improves the detection of 'mirage'—when Vision-Language Models confidently answ…
Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset
Qian Chen, Xianyin Zhang, Yanzhi Liu, Lifan Guo +2 more
This paper introduces CFMME, a comprehensive Chinese financial multimodal benchmark, and evaluates current Large Vision-Language Models (LVLMs), finding that while state-of-the-art models perform mode…
3DCodeBench: Benchmarking Agentic Procedural 3D Modeling Via Code
Yipeng Gao, Lei Shu, Genzhi Ye, Xi Xiong +4 more
The paper introduces 3DCodeBench, a systematic benchmark and platform for evaluating Vision-Language Model (VLM) agents' ability to generate procedural 3D models from text and images using code.
Laundering AI Authority with Adversarial Examples
The paper demonstrates that adversarial examples can be used to manipulate Vision-Language Models (VLMs) into confidently providing authoritative but incorrect information, a process termed 'AI author…
Seeing Before Agreeing: Aligning Multi-Agent Consensus with Visual Evidence
Yuhan Wang, Shuochen Chang, Yalin Feng, Dongsheng Ma +7 more
The paper proposes EAGLE, a novel evidence-aligned multi-agent framework, demonstrating that requiring shared visual evidence among agents is crucial for achieving reliable and trustworthy consensus i…
Toward Accountable AI-Generated Content on Social Platforms: Steganographic Attribution and Multimodal Harm Detection
Xinlei Guan, David Arosemena, Tejaswi Dhandu, Kuan Huang +6 more
The paper proposes an end-to-end forensic pipeline using steganographic attribution and multimodal harm detection to reliably trace and attribute harmful misuse of AI-generated imagery on social platf…
Cross-modal linkage risk in clinical vision-language models
The paper demonstrates that clinical vision-language models (VLMs) pose a significant privacy risk by allowing de-identified images to be re-linked to original reports, and proposes a targeted differe…
A Cross-Modal Prompt Injection Attack against Large Vision-Language Models with Image-Only Perturbation
Hao Yang, Zhuo Ma, Yang Liu, Yilong Yang +2 more
The paper introduces CrossMPI, a novel cross-modal prompt injection attack that uses image-only perturbations to steer the interpretation of both textual and visual inputs in Large Vision-Language Mod…
Variational Adapter for Cross-modal Similarity Representation
WenZhang Wei, Zhipeng Gui, Dehua Peng, Tiandi Ye +1 more
The paper proposes a Variational Adapter (VACSR) to improve cross-modal similarity representation by treating fine-grained image-text matching as a variational inference problem, thereby mitigating th…
"Înţelegi Româneşte?'' A Recipe for Romanian Vision-Language Models
This paper details the systematic construction and training of a high-performing Romanian Vision-Language Model (VLM), demonstrating that language-specific adaptation significantly boosts performance…
Adversarial attacks against Modern Vision-Language Models
The paper evaluates the adversarial robustness of two open-source Vision-Language Models (LLaVA and Qwen2.5-VL) in a simulated e-commerce environment, finding that while LLaVA is vulnerable to gradien…
One Single Hub Text Breaks CLIP: Identifying Vulnerabilities in Cross-Modal Encoders via Hubness
The paper proposes a method to identify 'hub texts' that exploit vulnerabilities in cross-modal encoders, demonstrating that a single text can achieve unrealistically high similarity scores across div…
Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models
Kai Bian, Xucheng Guo, Bin Chen, Lingyan Ruan +3 more
The paper introduces Pocket-Dentist, an efficiency-aware benchmark and model that demonstrates that compact, smaller Vision-Language Models (VLMs) can outperform larger models in accuracy while drasti…
Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models
Zijie Zhou, Dandan Zhu, Hangxiangpan Wang, Heng Zhang +2 more
The paper proposes AsyMoE, a novel Mixture of Experts architecture for Large Vision-Language Models that explicitly models the inherent asymmetry between visual and linguistic modalities, achieving si…
Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial Reasoning, But Only Barely
The paper evaluates the performance of Vision-Language Models (VLMs) in a collaborative dialogue task requiring spatial reconstruction, finding that while detailed text representations improve results…
AnyMo: Scaling Any-Modality Conditional Motion Generation with Masked Modeling
Yiheng Li, Zhuo Li, Ruibing Hou, Yingjie Chen +3 more
The paper introduces AnyMo, a unified multimodal framework that enables high-quality, scalable conditional human motion generation by leveraging a massive, cross-modal dataset and a masked modeling tr…
Towards Resolving Optimization Conflicts Between Image- and Text-Based Person Re-Identification
The paper proposes a decoupled two-stage training pipeline to effectively learn a shared representation for person re-identification by mitigating optimization conflicts between image-based and text-b…
TIGER: Traceable Inference with Graph-Based Evidence Routing for Mitigating Hallucinations in Multimodal Generation
Kaixiang Zhao, Tianrun Yu, Shawn Huang, Porter Jenkins +2 more
TIGER is an inference-time framework that uses graph-based evidence routing to independently assess and repair unsupported facts (hallucinations) in multimodal generation.
MACReD: A Multi-Agent Collaborative Reasoning Framework for Reaction Diagram Parsing
Chuang Tang, Chenhao Lin, Yin Xu, Hao Wang +4 more
MACReD introduces a hierarchical multi-agent framework that achieves state-of-the-art performance in parsing complex chemical reaction diagrams by coordinating specialized agents for perception and gl…