Papers similar to 2606.06407

~ similar to 2606.06407· 20 results

cs.CVcs.AIcs.CLRecentJun 1, 2026

Cross-modal linkage risk in clinical vision-language models

Soroosh Tayebi Arasteh, Mahshad Lotfinia, Sven Nebelung, Daniel Truhn

The paper demonstrates that clinical vision-language models (VLMs) pose a significant privacy risk by allowing de-identified images to be re-linked to original reports, and proposes a targeted differe…

View →

cs.CVcs.AIRecentMay 27, 2026

VITAL: Visual-Semantic Dual Supervision for Enhanced and Interpretable Latent Reasoning in Medical MLLMs

Qiaoru Li, Shaotian Liang, Jintao Chen, Haoran Sun +3 more

VITAL introduces a novel latent-space reasoning framework for medical MLLMs, utilizing visual-semantic dual supervision to enhance reasoning capabilities and provide crucial interpretability without s…

View →

cs.CVcs.AIcs.CLRecentMay 29, 2026

Generating Reports or Repeating Templates? Measuring and Mitigating Template Collapse in 3D CT Report Generation

Tom Maye-Lasserre, Yitong Li, Bailiang Jian, Morteza Ghahremani +2 more

The paper addresses 'Template Collapse' in 3D CT report generation—where models generate generic reports—by proposing CLarGen, a decoupled framework that significantly improves clinical accuracy and d…

View →

cs.CLRecentMay 31, 2026

Beyond Topical Similarity: Contrastive Evidence Retrieval with Interpretable Attention Alignment in RAG

Francielle Vargas, João Robiatti, Diego Alves, Lucas Pascotti Valem +5 more

The paper introduces CERA, a novel contrastive retrieval framework that improves RAG factuality and interpretability by using subjectivity-based hard negative selection and an auxiliary attention alig…

View →

cs.AIRecentMay 27, 2026

C-MIG: Multi-view Information Gain-based Retrieval-Augmented Generation for Clinical Diagnosis Reasoning

Yuwei Miao, Gen Li, Yunsheng Zeng, Xiandong Li +7 more

C-MIG is a novel retrieval-augmented generation framework that uses multi-view information gain to improve clinical diagnosis reasoning by providing richer, more nuanced reward signals than existing m…

View →

cs.AIcs.LGRecentJun 1, 2026

RL-ACRGNet: Reinforcement Learning-Based Chest Radiology Report Generation Network

Yogesh Kumar Meena, Saurabh Agarwal, K. V. Arya

The paper proposes RL-ACRGNet, an improved encoder-decoder model that uses reinforcement learning to generate high-quality, clinically coherent chest radiology reports, significantly outperforming exi…

View →

cs.CVcs.AIcs.LGRecentMay 28, 2026

CardioLens: Revealing the Clinical Reality Gap of MLLMs via Multi-Sequence Cardiac MRI Evaluations

Zixian Su, Hongkai Zhang, Fan Gao, Encheng Su +11 more

The paper introduces CardioLens, a rigorous evaluation testbed for multi-sequence Cardiac MRI, which reveals that current Multimodal Large Language Models (MLLMs) exhibit a significant 'clinical reali…

View →

cs.CVcs.CLRecentMay 31, 2026

Reasmory: 3D Reconstruction as Explicit Memory for VLMs Spatial Reasoning

Jixuan He, Xueting Li, Chieh Hubert Lin, Ming-Hsuan Yang

Reasmory introduces a structured programming framework that uses explicit 3D memory and a Domain-Specific Language (DSL) to reliably enhance Vision-Language Models' spatial reasoning capabilities, ach…

View →

cs.CLcs.AIcs.LGRecentMay 28, 2026

Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages

David Rey-Blanco, Roberto Cruz

The authors demonstrate that fine-tuning a two-stage retrieval system using synthetic data generated by large language models can significantly improve the performance of medical semantic search for c…

View →

cs.CVcs.AIRecentMay 29, 2026

Simple Token-Efficient Vision-Language Model for Case-level Pathology Synoptic Report Generation

Zhiyuan Yang, Jiahao Cheng, Vincent Quoc-Huy Trinh, Mahdi S. Hosseini

The paper introduces a simple, token-efficient vision-language model for generating comprehensive pathology synoptic reports from multiple whole-slide images (WSIs), achieving high performance while s…

View →

cs.CVcs.AIRecentMay 27, 2026

ROVER: Routing Object-Centric Visual Evidence for Grounded Multi-Image Reasoning

Guannan Lv, Ren Nie, Hongjian Dou, Tingting Gao

ROVER is a lightweight, learnable plugin that efficiently routes and integrates object-centric visual evidence across multiple images and objects, significantly improving performance on grounded multi…

View →

cs.CVcs.AIRecentMay 29, 2026

StemBind: When MLLMs Get Lost Between Rules and Instances in Abstract Visual Reasoning

Xixiang He, Baiqi Wu, Xingming Li, Ao Cheng +3 more

The paper introduces StemBind, a diagnostic benchmark that separates perception, rule induction, and answer selection in abstract visual reasoning, revealing that the primary failure point for MLLMs i…

View →

cs.CLRecentMay 31, 2026

PMC-InterCPT: Rethinking Biomedical Interleaved Data for Multimodal Continued Pretraining

Guanghao Zhu, Zeyu Liu, Zhitian Hou, Pengkai Wang +8 more

The paper introduces PMC-InterCPT, a refined biomedical interleaved corpus that enhances multimodal continued pretraining by integrating figure-referencing body text alongside captions, leading to imp…

View →

cs.CVRecentJun 1, 2026

Training-Free Composed Video Retrieval via Visual Representation-Guided Video-LLM Reasoning

Yang Liu, Qianqian Xu, Peisong Wen, Siran Dai +1 more

The paper proposes a training-free framework, Visual Representation-Guided Video-LLM Reasoning, to perform composed video retrieval by using visual examples and text instructions, achieving strong per…

View →

cs.AIRecentJun 1, 2026

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

Junqi Liu, Salena Song, Yuhan Wang, Jiawei Mao +11 more

The paper introduces AutoMedBench, a novel workflow-aware benchmark that evaluates autonomous medical-AI agents across a five-stage research process, revealing that agents struggle most with validatio…

View →

cs.CVcs.AIq-bio.NCRecentMay 28, 2026

Brain-IT-VQA: From Brain Signals to Answers

Roman Beliy, Matias Cosarinsky, Oliver Heinimann, Navve Wasserman +1 more

The paper introduces Brain-IT-VQA, a novel framework that significantly improves visual question answering from fMRI signals, and presents NSD-VQA, a new, highly controlled dataset for this task.

View →

cs.AIRecentMay 30, 2026

SDR: Set-Distance Rewards for Radiology Report Generation

Halil Ibrahim Gulluk, Max Van Puyvelde, Wim Van Criekinge, Olivier Gevaert

The paper introduces Set-Distance Rewards (SDR), a permutation-invariant reward signal that effectively guides the generation of unordered radiology reports, significantly outperforming standard train…

View →

cs.CVcs.AIRecentMay 31, 2026

Beyond Visual Memory: Mechanistic Diagnostics of Latent Visual Reasoning

Garvin Guo, Yu Chen, Xiang Wang, Shuai Li +3 more

The paper deconstructs latent visual reasoning tokens into components and finds that the performance gains are primarily due to boundary markers and attention patterns, not the tokens' ability to enco…

View →

cs.CVcs.AIRecentMay 28, 2026

Pocket-Dentist: On-Device Dental Image Understanding via Efficient Multimodal Large Language Models

Kai Bian, Xucheng Guo, Bin Chen, Lingyan Ruan +3 more

The paper introduces Pocket-Dentist, an efficiency-aware benchmark and model that demonstrates that compact, smaller Vision-Language Models (VLMs) can outperform larger models in accuracy while drasti…

View →

cs.AIRecentMay 27, 2026

Look on Demand: A Cognitive Scheduling Framework for Visual Evidence Acquisition in Multimodal Reasoning

Yang Zhang, Xiaoshuai Sun, Rui Zhao, Wujin Sun +4 more

The paper proposes CSMR, a cognitive scheduling framework that allows a language model to dynamically decide when to acquire task-relevant visual evidence, significantly improving multimodal reasoning…

View →