Papers similar to 2605.28025

~ similar to 2605.28025· 20 results

cs.AIRecentMay 28, 2026

Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text Generation

Kai-Chen Cheng, Haejun Han, David Q. Sun

The paper proposes 'Think Fast, Talk Smart,' a pipeline that separates deterministic data analysis from LLM generation, showing that offloading recurring, structured tasks to code significantly improv…

View →

cs.CLcs.AIcs.IRRecentMay 27, 2026

Same Question, Different Source, Different Answer: Auditing Source-Dependence in Medical Multi-Source RAG

Yubo Li, Rema Padman, Ramayya Krishnan

This paper introduces a framework to audit source-dependence in multi-source RAG systems, demonstrating that disagreement across institutional sources is a common and critical failure mode that curren…

View →

cs.AIRecentMay 27, 2026

Better Accuracies, Worse Reasoning: A Step-Level Audit of Medical Chain-of-Thought Distillation

Zhaoyang Jiang, Xuanqi Peng, Fei Teng, Zhizhong Fu +4 more

The paper demonstrates that while distilling large language models for medical QA can significantly improve final answer accuracy, this gain often comes at the cost of factual accuracy and detailed re…

View →

cs.HCcs.AIcs.CLRecentMay 28, 2026

LLUMI: Improving LLM Writing Assistance for Mental Health Support with Online Community Feedback

Jiwon Kim, Maya Ajit, Sherry Gong, Soorya Ram Shimgekar +3 more

The paper introduces LLUMI, an open-source framework that improves LLM writing assistance for mental health support using community feedback, demonstrating comparable performance to proprietary models…

View →

cs.CLcs.AIcs.CYRecentMay 31, 2026

Implicit Geographic Inference in LLM Medical Triage: Language-Driven Disparities in Emergency Recommendations

Qi Han Wong

The study demonstrates that LLMs exhibit significant, language-driven disparities in medical triage recommendations, recommending emergency care more frequently for English and Arabic prompts, even wh…

View →

cs.CLRecentMay 31, 2026

DrugClaw and DrugAudit: A Primary-Source-Grounded Agent and Authority-Aware Benchmark for Drug-Information Question Answering

Qing Wang, Bo Li, Jialu Liang, Daling Shi +2 more

The paper introduces DrugClaw, a multi-agent system, and DrugAudit, a new benchmark, demonstrating that DrugClaw excels at answering drug-related questions by grounding answers in primary regulatory s…

View →

cs.CLcs.AIRecentMay 28, 2026

Same Patient, Different Words, Different Diagnosis? Evaluating Semantic Stability in Clinical LLMs

Mahdi Alkaeed, Adnan Qayyum, Nabeel Abo Kashreef, Muhammad Bilal +1 more

The paper evaluates the semantic stability of clinical LLMs to linguistic variations, finding that domain specialization does not guarantee consistent robustness improvements.

View →

cs.AIRecentMay 27, 2026

SafeMed-R1: Clinician-Audited Safety and Ethics Alignment for Medical Large Language Models

Chao Ding, Mouxiao Bian, Tianbin Li, Minjia Yuan +11 more

The paper introduces SafeMed-R1, a clinically audited LLM that significantly improves safety and ethical alignment for medical applications, matching or exceeding resident performance on safety-critic…

View →

cs.CLRecentJun 1, 2026

Not What, But How: A Communicative Audit of LLM Response Framing

Siddhesh Milind Pawar, Sarah Masud, Haneul Yoo, Alice Oh +1 more

The paper introduces FRANZ, a communicative audit framework, to evaluate how LLMs frame responses to subjective questions, finding that LLMs exhibit statistically significant and coupled differences i…

View →

cs.CLcs.AIRecentMay 28, 2026

MedCase-Structured: A Text-to-FHIR Dataset for Benchmarking Diagnostic Reasoning in Clinically Realistic EHR Settings

Valentina Bui Muti, Eugénie Dulout, Ziquan Fu

The paper introduces MedCase-Structured, a synthetic, FHIR-formatted dataset designed to benchmark diagnostic reasoning in realistic EHR settings, showing that LLMs perform worse on structured data th…

View →

cs.IRcs.CLRecentMay 29, 2026

Evaluating Factual Density in Multi-Source RAG: A Study in Medical AI Accuracy

Michael R. DeMarco

The paper introduces Factual Density (FD*), a novel retrieval signal that measures the proportion of verified facts, demonstrating that optimizing RAG retrieval based on this density significantly imp…

View →

cs.LGcs.CLRecentMay 28, 2026

AMNESIA: A Large Scale Medical Unlearning Benchmark Suite with Disease-Informed Analysis

Saeedeh Davoudi, Reihaneh Iranmanesh, Ophir Frieder, Nazli Goharian

The paper introduces AMNESIA, the first large-scale, open-source benchmark for medical unlearning, demonstrating that current unlearning methods struggle to separate individual patient data from share…

View →

cs.AIcs.CLcs.ETRecentJun 1, 2026

ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents

Yuxing Lu, Yushuhong Lin, Wenqi Shi, J. Ben Tamo +3 more

The paper introduces ClinEnv, a novel interactive, multi-stage benchmark designed to evaluate LLMs' decision-making and information-gathering process during longitudinal inpatient medical simulations.

View →

cs.CLRecentMay 31, 2026

HypothesisMed: Inference-Time Answer Fusion and Structured Hypothesis-Space Reporting for Biomedical Question Answering

Md Motaleb Hossen Manik, Ge Wang

HypothesisMed introduces an inference-time pipeline for biomedical question answering that improves model reliability and structured output generation by fusing multiple model outputs and diagnosing t…

View →

cs.LGcs.CRRecentApr 29, 2026

Fidelity, Diversity, and Privacy: A Multi-Dimensional LLM Evaluation for Clinical Data Augmentation

Guillermo Iglesias, Gema Bello-Orgaz, María Navas-Loro, Cristian Ramirez-Atencia +2 more

This paper evaluates multiple LLMs (DeepSeek-R1, OpenBioLLM-Llama3, Qwen 3.5) for generating privacy-safe, high-quality synthetic mental health reports, demonstrating their effectiveness in expanding…

View →

cs.AIRecentMay 29, 2026

LLM-FACETS: A Privacy-Preserving Framework for Evaluating LLM Transparency and Accountability

Tom Lucas, Alessio Buscemi, Alfredo Capozucca, German Castignani +1 more

LLM-FACETS introduces an open-source, privacy-preserving framework designed to enable non-technical domain experts and compliance officers to audit and evaluate the transparency and accountability of…

View →

cs.CRcs.AIRecentMay 1, 2026

E-MIA: Exam-Style Black-Box Membership Inference Attacks against RAG Systems

Zelin Guan, Shengda Zhuo, Zeyan Li, Jinchun He +3 more

E-MIA introduces a novel, stealthy black-box membership inference attack that converts verifiable hard evidence within a candidate document into an objective, multi-part exam score to determine if the…

View →

cs.CLcs.AIcs.LGRecentMay 28, 2026

Generalistic or Specific Embeddings, Which is Better? An Empirical Study on Search for Clinical Coding in Non-English Languages

David Rey-Blanco, Roberto Cruz

The authors demonstrate that fine-tuning a two-stage retrieval system using synthetic data generated by large language models can significantly improve the performance of medical semantic search for c…

View →

cs.CLRecentMay 28, 2026

Auditing LLM Benchmarks with Item Response Theory

Sander Land, Daniel M. Bikel

The paper introduces an Item Response Theory (IRT)-based indicator that effectively identifies likely mislabeled items in existing LLM benchmarks, revealing systematic errors in labeling and model spe…

View →

cs.CLcs.CRRecentMay 9, 2026

BiAxisAudit: A Novel Framework to Evaluate LLM Bias Across Prompt Sensitivity and Response-Layer Divergence

Jialing Gan, Junhao Dong, Songze Li

The paper introduces BiAxisAudit, a novel framework that evaluates LLM bias by analyzing bias scores across multiple prompt formats and within the internal inconsistency of model responses, revealing…

View →