~ similar to 2605.30295· 20 results
Yuxing Lu, Yushuhong Lin, Wenqi Shi, J. Ben Tamo +3 more
The paper introduces ClinEnv, a novel interactive, multi-stage benchmark designed to evaluate LLMs' decision-making and information-gathering process during longitudinal inpatient medical simulations.
Yuzhang Xie, Keqi Han, Yunpeng Xiao, Hejie Cui +6 more
The paper introduces EHRBench, a large-scale, automated, and reliable benchmark derived from real Electronic Health Records (EHRs) to rigorously evaluate the clinical decision-making capabilities of L…
The paper proposes 'Think Fast, Talk Smart,' a pipeline that separates deterministic data analysis from LLM generation, showing that offloading recurring, structured tasks to code significantly improv…
Yuwei Miao, Gen Li, Yunsheng Zeng, Xiandong Li +7 more
C-MIG is a novel retrieval-augmented generation framework that uses multi-view information gain to improve clinical diagnosis reasoning by providing richer, more nuanced reward signals than existing m…
Mengyu Xu, Qiaoxin Yang, Qianqian Wang, Xiwei Dai +2 more
The paper introduces MIRA, a bilingual benchmark that reveals that LLMs tend to dilute or omit critical medical information when responding to prompts from users with low health literacy, a pattern te…
The paper introduces Factual Density (FD*), a novel retrieval signal that measures the proportion of verified facts, demonstrating that optimizing RAG retrieval based on this density significantly imp…
The paper evaluates the semantic stability of clinical LLMs to linguistic variations, finding that domain specialization does not guarantee consistent robustness improvements.
Baris Karacan, Vaibhav Bhargava, Barbara Di Eugenio, Natalie Parde +20 more
The paper introduces a supervised fine-tuning pipeline using large language models to accurately categorize sentence-level clinical provenance across multi-disciplinary hospital notes, demonstrating t…
The paper introduces ClinPivot, a benchmark that tests whether clinical models can correctly adjust treatment decisions when new patient context constraints are introduced, finding that strong medical…
HypothesisMed introduces an inference-time pipeline for biomedical question answering that improves model reliability and structured output generation by fusing multiple model outputs and diagnosing t…
This paper introduces KliniskVestBERT, a suite of BERT models specialized by pre-training on a large, diverse corpus of real-world Norwegian clinical texts, demonstrating superior performance for clin…
The paper investigates apparent LLM triage failures and concludes that the errors originate in the output format and decision process, rather than a deficiency in the model's underlying clinical knowl…
The paper addresses 'Template Collapse' in 3D CT report generation—where models generate generic reports—by proposing CLarGen, a decoupled framework that significantly improves clinical accuracy and d…
Chao Ding, Mouxiao Bian, Tianbin Li, Minjia Yuan +11 more
The paper introduces SafeMed-R1, a clinically audited LLM that significantly improves safety and ethical alignment for medical applications, matching or exceeding resident performance on safety-critic…
Junqi Liu, Salena Song, Yuhan Wang, Jiawei Mao +11 more
The paper introduces AutoMedBench, a novel workflow-aware benchmark that evaluates autonomous medical-AI agents across a five-stage research process, revealing that agents struggle most with validatio…
Yeqi Huang, Yue Chen, Yanwei Ye, Guanhao Su +1 more
The paper introduces Ryze, an automated system that synthesizes evidence-enriched Question-Answering (QA) pairs from raw biomedical papers, resulting in a specialized VLM (BioVLM-8B) that significantl…
This paper evaluates multiple LLMs (DeepSeek-R1, OpenBioLLM-Llama3, Qwen 3.5) for generating privacy-safe, high-quality synthetic mental health reports, demonstrating their effectiveness in expanding…
The paper introduces RefWalk, a novel framework designed to improve regulatory compliance question answering by ensuring rigorous citation traceability and explicit per-rule attribution across complex…
The paper introduces a simple, token-efficient vision-language model for generating comprehensive pathology synoptic reports from multiple whole-slide images (WSIs), achieving high performance while s…
AutoForest is an end-to-end system that automatically generates publication-ready forest plots directly from biomedical papers, streamlining the labor-intensive process of meta-analysis.