~ similar to 2605.31512· 20 results
This paper introduces KliniskVestBERT, a suite of BERT models specialized by pre-training on a large, diverse corpus of real-world Norwegian clinical texts, demonstrating superior performance for clin…
The study demonstrates that LLMs exhibit significant, language-driven disparities in medical triage recommendations, recommending emergency care more frequently for English and Arabic prompts, even wh…
Yuzhang Xie, Keqi Han, Yunpeng Xiao, Hejie Cui +6 more
The paper introduces EHRBench, a large-scale, automated, and reliable benchmark derived from real Electronic Health Records (EHRs) to rigorously evaluate the clinical decision-making capabilities of L…
The paper introduces Script-Normalized WER (SN-WER), a novel evaluation metric that transliterates ASR transcripts into a canonical script to accurately measure speech recognition performance across d…
Baris Karacan, Vaibhav Bhargava, Barbara Di Eugenio, Natalie Parde +20 more
The paper introduces a supervised fine-tuning pipeline using large language models to accurately categorize sentence-level clinical provenance across multi-disciplinary hospital notes, demonstrating t…
The paper evaluates the semantic stability of clinical LLMs to linguistic variations, finding that domain specialization does not guarantee consistent robustness improvements.
The paper proposes 'Think Fast, Talk Smart,' a pipeline that separates deterministic data analysis from LLM generation, showing that offloading recurring, structured tasks to code significantly improv…
The paper introduces MedCase-Structured, a synthetic, FHIR-formatted dataset designed to benchmark diagnostic reasoning in realistic EHR settings, showing that LLMs perform worse on structured data th…
Mengyu Xu, Qiaoxin Yang, Qianqian Wang, Xiwei Dai +2 more
The paper introduces MIRA, a bilingual benchmark that reveals that LLMs tend to dilute or omit critical medical information when responding to prompts from users with low health literacy, a pattern te…
The paper addresses 'Template Collapse' in 3D CT report generation—where models generate generic reports—by proposing CLarGen, a decoupled framework that significantly improves clinical accuracy and d…
This paper evaluates multiple LLMs (DeepSeek-R1, OpenBioLLM-Llama3, Qwen 3.5) for generating privacy-safe, high-quality synthetic mental health reports, demonstrating their effectiveness in expanding…
The paper investigates apparent LLM triage failures and concludes that the errors originate in the output format and decision process, rather than a deficiency in the model's underlying clinical knowl…
The authors demonstrate that fine-tuning a two-stage retrieval system using synthetic data generated by large language models can significantly improve the performance of medical semantic search for c…
The paper introduces a multilingual corpus and demonstrates that small, fine-tuned language models (SLMs) are highly effective for Citation Needed Detection (CND) in lower-resource languages, often ou…
Tengfei Zhang, Ziheng Zhao, Lisong Dai, Xiaoman Zhang +4 more
This paper introduces MedReCo and MedReCo-VLM, a framework that enables entity-aware cross-image reasoning for medical imaging, allowing AI to compare current scans with prior studies and analogous ca…
Chao Ding, Mouxiao Bian, Tianbin Li, Minjia Yuan +11 more
The paper introduces SafeMed-R1, a clinically audited LLM that significantly improves safety and ethical alignment for medical applications, matching or exceeding resident performance on safety-critic…
HypothesisMed introduces an inference-time pipeline for biomedical question answering that improves model reliability and structured output generation by fusing multiple model outputs and diagnosing t…
This paper investigates why self-harm prediction models struggle to generalize across different hospitals, finding that variations in local lexical expression and feature importance are the primary ca…
This study systematically analyzes strategies for creating reliable multilingual LLMs-as-a-judge, finding that fine-tuning smaller models with in-domain data is effective, while zero-shot evaluation w…
The paper introduces Reverse Probing, a novel framework that quantifies token-level uncertainty in large language models (LLMs) specifically for clinical text by analyzing internal model activations,…