~ similar to 2605.28422· 20 results
The paper systematically evaluates concept-based explainability in MLLMs, finding that forcing models to generate formal explanations degrades predictive accuracy, suggesting that explaining is genuin…
Tengfei Zhang, Ziheng Zhao, Lisong Dai, Xiaoman Zhang +4 more
This paper introduces MedReCo and MedReCo-VLM, a framework that enables entity-aware cross-image reasoning for medical imaging, allowing AI to compare current scans with prior studies and analogous ca…
Xixiang He, Baiqi Wu, Xingming Li, Ao Cheng +3 more
The paper introduces StemBind, a diagnostic benchmark that separates perception, rule induction, and answer selection in abstract visual reasoning, revealing that the primary failure point for MLLMs i…
Garvin Guo, Yu Chen, Xiang Wang, Shuai Li +3 more
The paper deconstructs latent visual reasoning tokens into components and finds that the performance gains are primarily due to boundary markers and attention patterns, not the tokens' ability to enco…
Zixian Su, Hongkai Zhang, Fan Gao, Encheng Su +11 more
The paper introduces CardioLens, a rigorous evaluation testbed for multi-sequence Cardiac MRI, which reveals that current Multimodal Large Language Models (MLLMs) exhibit a significant 'clinical reali…
Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng +5 more
The paper identifies a failure mode called spatial lexical bias in MLLMs, where adding a spatial word to options biases the model's choice, and demonstrates that this failure originates primarily from…
Shuochen Chang, Tong Bai, Xiaofeng Zhang, Qianli Ma +4 more
This paper introduces interpretability-guided, training-free interventions that systematically improve the accuracy and controllability of latent reasoning in LLMs by leveraging structural and causal…
Mahtab Bigverdi, Lindsey Li, Weikai Huang, Yiming Liu +7 more
This paper introduces Imaginative Perception Tokens (IPT) to improve spatial reasoning in vision language models.
The paper introduces Responsible Contrastive Soft Prompting (RCSP), a parameter-efficient method using soft prompts to improve LLM reliability by simultaneously suppressing hallucinations, encouraging…
Xiaolin Liu, Yilun Zhu, Xiangyu Zhao, Xuehui Wang +8 more
The paper introduces Moment-Video, a new benchmark that diagnoses the ability of video MLLMs to understand brief, critical visual events, revealing that current models struggle significantly with temp…
Ruina Hu, Chen Wang, Lai Wei, Jionghao Bai +4 more
The paper introduces EASE, a method that enhances multimodal Reinforcement Learning with Verifiable Rewards (RLVR) by providing spatial attention supervision anchored to visual evidence, significantly…
The paper introduces CERA, a novel contrastive retrieval framework that improves RAG factuality and interpretability by using subjectivity-based hard negative selection and an auxiliary attention alig…
Yang Zhang, Xiaoshuai Sun, Rui Zhao, Wujin Sun +4 more
The paper proposes CSMR, a cognitive scheduling framework that allows a language model to dynamically decide when to acquire task-relevant visual evidence, significantly improving multimodal reasoning…
The paper introduces MLLM-Microscope, a system that analyzes the internal structure of multimodal large language models (MLLMs), finding that modality fusion significantly impacts the linearity and di…
This paper analyzes failure modes in collaborative visual reasoning systems, demonstrating that naive shared workspaces can amplify hallucinations and proposing diagnostics for improving communication…
ROVER is a lightweight, learnable plugin that efficiently routes and integrates object-centric visual evidence across multiple images and objects, significantly improving performance on grounded multi…
Yinsong Xu, Wei Jing, Liuxin Zhang, Wanjun Lv +1 more
The paper proposes a unified framework that decouples long-video reasoning into semantic and visual evidence, significantly improving performance on the HD-EPIC VQA Challenge.
Tianze Yang, Yucheng Shi, Ruitong Sun, Jingyuan Huang +2 more
The paper introduces TRON, an online, rule-verifiable environment substrate that generates an unbounded stream of fresh, controllable visual reasoning training instances, significantly improving RL pe…
Ye Leng, Junjie Chu, Mingjie Li, Chenhao Lin +4 more
The paper analyzes that while multimodal large language models (MLLMs) offer superior semantic understanding for image generation, this enhanced capability significantly increases safety risks, partic…
Yuhan Wang, Shuochen Chang, Yalin Feng, Dongsheng Ma +7 more
The paper proposes EAGLE, a novel evidence-aligned multi-agent framework, demonstrating that requiring shared visual evidence among agents is crucial for achieving reliable and trustworthy consensus i…