Papers similar to 2606.00909

~ similar to 2606.00909· 17 results

cs.CRcs.AIRecentMar 30, 2026

Adversarial Attacks on Multimodal Large Language Models: A Comprehensive Survey

Bhavuk Jain, Sercan Ö. Arık, Hardeo K. Thakur

This survey provides a comprehensive taxonomy and vulnerability-centric analysis of adversarial attacks targeting Multimodal Large Language Models (MLLMs), offering an explanatory framework for enhanc…

View →

cs.CVcs.AIRecentMay 28, 2026

VLM3: Vision Language Models Are Native 3D Learners

Zhipeng Cai, Zhuang Liu, Yunyang Xiong, Zechun Liu +2 more

The paper proposes VLM3, a simple, scalable method that demonstrates standard Vision Language Models (VLMs) can natively learn 3D understanding by focusing on architectural simplicity and specific dat…

View →

cs.CVcs.AIcs.CLRecentMay 31, 2026

On the Limits of Token Reduction for Efficient Unified Vision Language Training

Siyi Chen, Weiming Zhuang, Jingtao Li, Lingjuan Lv

The paper analyzes token reduction for efficient unified VLM training, finding that while task-specific acceleration saves computation, it destroys the mutual performance gains achieved through joint…

View →

cs.CLcs.AIcs.LGRecentJun 1, 2026

Multilinguality of Large Language Models From a Structural Perspective

Haruki Sakajo, Yusuke Sakai, Hidetaka Kamigaito, Taro Watanabe

This paper analyzes the multilinguality of LLMs by examining their structural properties, finding that low-resource languages are structurally more distinct from English than high-resource languages,…

View →

cs.LGcs.AIcs.CLRecentMay 27, 2026

Parallax: Parameterized Local Linear Attention for Language Modeling

Yifei Zuo, Dhruv Pai, Zhichen Zeng, Alec Dewulf +2 more

The paper introduces Parallax, a scalable and numerically stable parameterized Local Linear Attention mechanism that significantly improves LLM performance and efficiency compared to existing methods…

View →

cs.CVRecentJun 4, 2026

PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding

Shaohui Dai, Yansong Qu, You Shen, Shengchuan Zhang +1 more

The paper introduces PAR3D, a unified part-aware 3D-MLLM framework, to enhance 3D scene understanding by enabling models to reason about and ground both whole objects and their fine-grained parts.

View →

cs.CLcs.CVRecentJun 1, 2026

Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning

Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng +5 more

The paper identifies a failure mode called spatial lexical bias in MLLMs, where adding a spatial word to options biases the model's choice, and demonstrates that this failure originates primarily from…

View →

cs.CVcs.AIRecentMay 29, 2026

What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness

Yusheng He, Jizhe Zhou, Xia Du, Zheng Lin +2 more

This paper systematically analyzes how different architectural components of Large Vision-Language Models (LVLMs) contribute to hallucination robustness, finding that joint enhancement of visual fidel…

View →

cs.AIcs.CLcs.LGRecentMay 27, 2026

Explaining is Harder Than Predicting Alone: Evaluating Concept-based Explanations of MLLMs as ICL Visual Classifiers

Carmen Quiles-Ramírez, Leticia L. Rodríguez, Nicolás Martorell, Natalia Díaz-Rodríguez

The paper systematically evaluates concept-based explainability in MLLMs, finding that forcing models to generate formal explanations degrades predictive accuracy, suggesting that explaining is genuin…

View →

cs.CVcs.AIcs.CRRecentApr 10, 2026

Leave My Images Alone: Preventing Multi-Modal Large Language Models from Analyzing Images via Visual Prompt Injection

Zedian Shao, Hongbin Liu, Yuepeng Hu, Neil Zhenqiang Gong

The paper introduces ImageProtector, a user-side method that embeds an imperceptible perturbation into images to prevent Multi-modal Large Language Models (MLLMs) from analyzing and extracting sensiti…

View →

cs.CVcs.AIRecentMay 29, 2026

Zamba2-VL Technical Report

Hassan Shapourian, Kasra Hejazi, Olabode M. Sule, Beren Millidge

Zamba2-VL is a new suite of vision-language models built on the Zamba2 hybrid architecture, achieving state-of-the-art performance and significantly improved inference efficiency compared to leading T…

View →

cs.MMcs.AIcs.CLRecentMay 29, 2026

A Pilot Study on Curator-Guided Multilingual Art Description for Blind and Low-Vision Audiences with Small Vision-Language Models

Iosif Tsangko, Andreas Triantafyllopoulos, George Margetis, Ioana Crihana +1 more

This pilot study evaluates curator-guided multilingual art description using a small, on-premise VLM (Qwen2.5-VL-3B-Instruct) for German, Romanian, and Serbian, finding that language-specific adapters…

View →

cs.CLcs.AIRecentMay 27, 2026

PromptEmbedder:: Efficient and Transferable Text Embedding via Dual-LLM Soft Prompting

Yu-Che Tsai, Kuan-Yu Chen, Yuan-Hao Chen, Yu-Han Chang +3 more

PromptEmbedder introduces a dual-LLM framework that efficiently and transferably adapts text embeddings by decoupling task-specific knowledge from the backbone model, significantly reducing computatio…

View →

cs.LGcs.CLRecentMay 28, 2026

Measuring, Localizing, and Ablating Alignment Signatures in LLMs

Aniket Anand, Janvijay Singh, Zhewei Sun, Dilek Hakkani-Tür +1 more

The paper demonstrates that the AI-like style introduced by post-training alignment can be measured, localized, and causally removed using a novel ablation technique called PASTA.

View →

cs.CVcs.AIRecentMay 28, 2026

Benchmarking Large Vision-Language Models on CFMME: A Comprehensive Chinese Financial Multimodal Evaluation Dataset

Qian Chen, Xianyin Zhang, Yanzhi Liu, Lifan Guo +2 more

This paper introduces CFMME, a comprehensive Chinese financial multimodal benchmark, and evaluates current Large Vision-Language Models (LVLMs), finding that while state-of-the-art models perform mode…

View →

cs.CLcs.AIeess.ASRecentMay 31, 2026

PolySpeech-100: A Large-Scale Benchmark for Speech Understanding Across 100+ Languages and Dialects

Sicheng Yang, Shulan Ruan, Shiwei Wu, Yu Liu +3 more

PolySpeech-100 introduces a massive, multi-lingual benchmark covering 110 linguistic variants to rigorously test Speech-LLMs, demonstrating that open-source models struggle with low-resource languages…

View →

cs.CLRecentMay 29, 2026

"Înţelegi Româneşte?'' A Recipe for Romanian Vision-Language Models

Mihai Masala, Marius Leordeanu, Mihai Dascalu, Traian Rebedea

This paper details the systematic construction and training of a high-performing Romanian Vision-Language Model (VLM), demonstrating that language-specific adaptation significantly boosts performance…

View →