ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2605.29889· 20 results

cs.CLcs.AIcs.CYRecentMay 31, 2026

Implicit Geographic Inference in LLM Medical Triage: Language-Driven Disparities in Emergency Recommendations

Qi Han Wong

The study demonstrates that LLMs exhibit significant, language-driven disparities in medical triage recommendations, recommending emergency care more frequently for English and Arabic prompts, even wh…

View →
cs.AIRecentMay 28, 2026

Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text Generation

Kai-Chen Cheng, Haejun Han, David Q. Sun

The paper proposes 'Think Fast, Talk Smart,' a pipeline that separates deterministic data analysis from LLM generation, showing that offloading recurring, structured tasks to code significantly improv…

View →
cs.CLRecentJun 1, 2026

Why Do Self-Harm Prediction Models Struggle to Generalise? Lexical and Semantic Variations in Emergency Department Triage Notes

Liuliu Chen, Mike Conway, Jo Robinson, Vlada Rozova

This paper investigates why self-harm prediction models struggle to generalize across different hospitals, finding that variations in local lexical expression and feature importance are the primary ca…

View →
cs.CLcs.AIRecentMay 28, 2026

Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

Mahdi Alkaeed, Adnan Qayyum, Nabeel Abo Kashreef, Muhammad Bilal +1 more

The paper evaluates the semantic stability of clinical LLMs to linguistic variations, finding that domain specialization does not guarantee consistent robustness improvements.

View →
cs.CLcs.AIcs.LGRecentMay 30, 2026

On the Limits of LLM Adaptability: Impact of Model-Internalized Priors on Annotation Task Performance

Etienne Casanova, Rafal Kocielnik, R. Michael Alvarez

The paper demonstrates that LLM performance in zero-shot annotation is significantly limited by the alignment between the model's internal understanding and the task definition, showing that prompt-ba…

View →
cs.CLcs.AIRecentMay 28, 2026

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

Valentina Bui Muti, Eugénie Dulout, Ziquan Fu

The paper introduces MedCase-Structured, a synthetic, FHIR-formatted dataset designed to benchmark diagnostic reasoning in realistic EHR settings, showing that LLMs perform worse on structured data th…

View →
cs.CLRecentJun 1, 2026

Towards Multidisciplinary Summarization of Hospital Stays: Efficient Sentence-Level Clinical Provenance Categorization

Baris Karacan, Vaibhav Bhargava, Barbara Di Eugenio, Natalie Parde +20 more

The paper introduces a supervised fine-tuning pipeline using large language models to accurately categorize sentence-level clinical provenance across multi-disciplinary hospital notes, demonstrating t…

View →
cs.CLcs.AIcs.LGRecentMay 28, 2026

Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages

David Rey-Blanco, Roberto Cruz

The authors demonstrate that fine-tuning a two-stage retrieval system using synthetic data generated by large language models can significantly improve the performance of medical semantic search for c…

View →
cs.AIcs.CLcs.CYRecentMay 27, 2026

MIRA: A Bilingual Benchmark for Medical Information Response Audit

Mengyu Xu, Qiaoxin Yang, Qianqian Wang, Xiwei Dai +2 more

The paper introduces MIRA, a bilingual benchmark that reveals that LLMs tend to dilute or omit critical medical information when responding to prompts from users with low health literacy, a pattern te…

View →
cs.CLRecentMay 31, 2026

HypothesisMed: Inference-Time Answer Fusion and Structured Hypothesis-Space Reporting for Biomedical Question Answering

Md Motaleb Hossen Manik, Ge Wang

HypothesisMed introduces an inference-time pipeline for biomedical question answering that improves model reliability and structured output generation by fusing multiple model outputs and diagnosing t…

View →
cs.AIRecentMay 28, 2026

EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

Yuzhang Xie, Keqi Han, Yunpeng Xiao, Hejie Cui +6 more

The paper introduces EHRBench, a large-scale, automated, and reliable benchmark derived from real Electronic Health Records (EHRs) to rigorously evaluate the clinical decision-making capabilities of L…

View →
cs.LGcs.CLRecentMay 28, 2026

AMNESIA: A Large Scale Medical Unlearning Benchmark Suite with Disease-Informed Analysis

Saeedeh Davoudi, Reihaneh Iranmanesh, Ophir Frieder, Nazli Goharian

The paper introduces AMNESIA, the first large-scale, open-source benchmark for medical unlearning, demonstrating that current unlearning methods struggle to separate individual patient data from share…

View →
cs.LGcs.CRRecentApr 29, 2026

Fidelity, Diversity, and Privacy: A Multi-Dimensional LLM Evaluation for Clinical Data Augmentation

Guillermo Iglesias, Gema Bello-Orgaz, María Navas-Loro, Cristian Ramirez-Atencia +2 more

This paper evaluates multiple LLMs (DeepSeek-R1, OpenBioLLM-Llama3, Qwen 3.5) for generating privacy-safe, high-quality synthetic mental health reports, demonstrating their effectiveness in expanding…

View →
cs.CLRecentJun 1, 2026

Encoded but Not Routed: Explaining the Table-Chart Gap in Scientific Claim Verification

Sunisth Kumar, Xanh Ho, Tim Schopf, Andre Greiner-Petter +2 more

The paper explains the 'table-chart gap' in scientific claim verification by showing that multimodal LLMs successfully encode information from charts but fail to route it to the final prediction layer…

View →
stat.OTcs.AIEmpiricalRecentJun 9, 2026

Flaws in the LLM Automation Narrative

George Perrett, Javae Elliott, Jennifer Hill, Marc Scott

This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.

View →
stat.OTcs.AIEmpiricalRecentJun 9, 2026

Flaws in the LLM Automation Narrative

George Perrett, Javae Elliott, Jennifer Hill, Marc Scott

This paper evaluates the performance of a Large Language Model (LLM) in a high-stakes context by comparing it to human experts and measuring variance and error magnitude.

View →
cs.CLcs.AIRecentJun 1, 2026

KliniskVestBERT: BERT Model Specialised to Norwegian Clinical Texts

Christian Autenried, Cosimo Persia

This paper introduces KliniskVestBERT, a suite of BERT models specialized by pre-training on a large, diverse corpus of real-world Norwegian clinical texts, demonstrating superior performance for clin…

View →
cs.AIcs.CLcs.ETRecentJun 1, 2026

ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents

Yuxing Lu, Yushuhong Lin, Wenqi Shi, J. Ben Tamo +3 more

The paper introduces ClinEnv, a novel interactive, multi-stage benchmark designed to evaluate LLMs' decision-making and information-gathering process during longitudinal inpatient medical simulations.

View →
cs.CLcs.CVRecentJun 1, 2026

Mechanistic Diagnostics of Spatial Lexical Bias in Multimodal Large Language Model Spatial Reasoning

Chuang Ma, Qianying Liu, Tomoyuki Obuchi, Fei Cheng +5 more

The paper identifies a failure mode called spatial lexical bias in MLLMs, where adding a spatial word to options biases the model's choice, and demonstrates that this failure originates primarily from…

View →
cs.CLcs.AIcs.IRRecentMay 27, 2026

Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG

Yubo Li, Rema Padman, Ramayya Krishnan

This paper introduces a framework to audit source-dependence in multi-source RAG systems, demonstrating that disagreement across institutional sources is a common and critical failure mode that curren…

View →