Papers similar to 2606.00275

~ similar to 2606.00275· 19 results

cs.CLcs.CVRecentJun 1, 2026

Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning

Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng +5 more

The paper identifies a failure mode called spatial lexical bias in MLLMs, where adding a spatial word to options biases the model's choice, and demonstrates that this failure originates primarily from…

View →

cs.CVcs.AIRecentMay 29, 2026

What Makes LVLMs Hallucinate Less? Unveiling the Architectural Factors Behind Hallucination Robustness

Yusheng He, Jizhe Zhou, Xia Du, Zheng Lin +2 more

This paper systematically analyzes how different architectural components of Large Vision-Language Models (LVLMs) contribute to hallucination robustness, finding that joint enhancement of visual fidel…

View →

cs.CVcs.AIRecentMay 28, 2026

VLM3: Vision Language Models Are Native 3D Learners

Zhipeng Cai, Zhuang Liu, Yunyang Xiong, Zechun Liu +2 more

The paper proposes VLM3, a simple, scalable method that demonstrates standard Vision Language Models (VLMs) can natively learn 3D understanding by focusing on architectural simplicity and specific dat…

View →

cs.AIRecentMay 27, 2026

Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning

Yang Zhang, Xiaoshuai Sun, Rui Zhao, Wujin Sun +4 more

The paper proposes CSMR, a cognitive scheduling framework that allows a language model to dynamically decide when to acquire task-relevant visual evidence, significantly improving multimodal reasoning…

View →

cs.CLRecentMay 29, 2026

dMoE: dLLMs with Learnable Block Experts

Sicheng Feng, Zigeng Chen, Gongfan Fang, Xinyin Ma +1 more

dMoE proposes a block-level Mixture-of-Experts (MoE) framework for Diffusion Large Language Models (dLLMs) that aggregates token-level expert distributions into a unified block-level distribution, sig…

View →

cs.AIRecentMay 31, 2026

Towards Understanding Modality Interaction in Multimodal Language Models via Partial Information Decomposition

Wanlong Fang, Tianle Zhang, Wen Tao, Alvin Chan

The paper introduces Partial Information Decomposition (PID) to quantitatively separate unique, redundant, and synergistic contributions of different modalities (e.g., vision, language) in multimodal…

View →

cs.CLcs.AIRecentMay 30, 2026

MLLM-Microscope: Unlocking Hidden Structure Within Multimodal Large Language Models

Ravil Mussabayev, Rustam Mussabayev

The paper introduces MLLM-Microscope, a system that analyzes the internal structure of multimodal large language models (MLLMs), finding that modality fusion significantly impacts the linearity and di…

View →

cs.CVcs.AIRecentMay 28, 2026

Semantic and Visual Evidence for Efficient Long-Video Reasoning: A Solution for the HD-EPIC VQA Challenge

Yinsong Xu, Wei Jing, Liuxin Zhang, Wanjun Lv +1 more

The paper proposes a unified framework that decouples long-video reasoning into semantic and visual evidence, significantly improving performance on the HD-EPIC VQA Challenge.

View →

cs.AIRecentMay 27, 2026

Do LLMs Build World Models From Text? A Multilingual Diagnostic of Spatial Reasoning

Zhikai Pan, Chih-Ting Liao, Chunrui Liu, Xi Xiao +4 more

The paper introduces a multilingual benchmark (MentalMap) to test if LLMs build internal spatial world models from text, finding a universal 'L3 reasoning cliff' suggesting that text-only working memo…

View →

cs.CVRecentJun 4, 2026

PAR3D: A Unified 3D-MLLM with Part-Aware Representation for Scene Understanding

Shaohui Dai, Yansong Qu, You Shen, Shengchuan Zhang +1 more

The paper introduces PAR3D, a unified part-aware 3D-MLLM framework, to enhance 3D scene understanding by enabling models to reason about and ground both whole objects and their fine-grained parts.

View →

cs.CVcs.AIRecentMay 28, 2026

Mitigating Hallucination in Vision-Language Models through Barrier-Regulated Adaptive Closed-form Steering

Soumyadeep Jana, Pulkit Mittal, Sanasam Ranbir Singh

The paper proposes BRACS, a training-free steering framework that adaptively corrects visual grounding failures in large vision-language models, significantly reducing object hallucination without sac…

View →

cs.CVcs.AIcs.CLRecentMay 31, 2026

On the Limits of Token Reduction for Efficient Unified Vision Language Training

Siyi Chen, Weiming Zhuang, Jingtao Li, Lingjuan Lv

The paper analyzes token reduction for efficient unified VLM training, finding that while task-specific acceleration saves computation, it destroys the mutual performance gains achieved through joint…

View →

cs.CVcs.CLRecentMay 29, 2026

Attend to Evidence: Evidence-Anchored Spatial Attention Supervision for Multimodal RLVR

Ruina Hu, Chen Wang, Lai Wei, Jionghao Bai +4 more

The paper introduces EASE, a method that enhances multimodal Reinforcement Learning with Verifiable Rewards (RLVR) by providing spatial attention supervision anchored to visual evidence, significantly…

View →

cs.CLcs.RORecentMay 29, 2026

Multi-Turn Multi-Agent Dialogue for Collaborative Reconstruction Improves VLM Performance on Spatial Reasoning, But Only Barely

Chalamalasetti Kranti, Sherzod Hakimov, David Schlangen

The paper evaluates the performance of Vision-Language Models (VLMs) in a collaborative dialogue task requiring spatial reconstruction, finding that while detailed text representations improve results…

View →

cs.CLcs.CVRecentMay 30, 2026

Do Text Edits Generalize to Visual Generation? Benchmarking Cross-Modal Knowledge Editing in UMMs

Xin Gao, Cheng Yang, Chufan Shi, Taylor Berg-Kirkpatrick

The paper introduces UniKE, a benchmark showing that successful knowledge edits in text-only multimodal models do not reliably transfer to image generation, revealing a significant modality gap.

View →

cs.CVcs.AIRecentMay 27, 2026

VidPrism: Heterogeneous Mixture of Experts for Image-to-Video Transfer

Rui Lin, Chuanming Wang, Huadong Ma

VidPrism introduces a novel heterogeneous Mixture-of-Experts framework that specializes temporal processing by dividing labor among experts, achieving state-of-the-art performance in image-to-video tr…

View →

cs.CVcs.CLRecentMay 30, 2026

Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding

Hee Suk Yoon, Eunseop Yoon, Jaehyun Jang, SooHwan Eom +5 more

The paper proposes Visual Gradient Steering (VGS), a method that decomposes the distillation loss into language and visual components and steers the optimization to prioritize visual grounding, signif…

View →

cs.CVcs.AIRecentMay 27, 2026

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning

Guannan Lv, Ren Nie, Hongjian Dou, Tingting Gao

ROVER is a lightweight, learnable plugin that efficiently routes and integrates object-centric visual evidence across multiple images and objects, significantly improving performance on grounded multi…

View →

cs.CLcs.AIRecentMay 27, 2026

Routing-Aligned Fine-Tuning for Multilingual Downstream Tasks in Mixture-of-Experts Models

Guanzhi Deng, Kuan Wu, Haibo Wang, Shing Yin Wong +2 more

The paper introduces RA-MoE, a novel fine-tuning framework that leverages the internal routing structure of Mixture-of-Experts (MoE) models to improve performance on multilingual downstream tasks by a…

View →