~ similar to 2605.29744· 20 results
Dongsheng Shi, Yue Li, Xin Yi, Yongyi Cui +2 more
The paper introduces SURGENT, a multi-agent assistance system designed for the entire perioperative workflow, which outperforms standard LLMs by providing context-aware, traceable, and privacy-preserv…
Junqi Liu, Salena Song, Yuhan Wang, Jiawei Mao +11 more
The paper introduces AutoMedBench, a novel workflow-aware benchmark that evaluates autonomous medical-AI agents across a five-stage research process, revealing that agents struggle most with validatio…
The paper proposes ARSM-Agent, a full-link security enhancement framework, to significantly improve the adversarial robustness and security of large language model agents used for critical medical dec…
This study analyzes ClinicalTrials.gov records to track the rising trend of AI in clinical trials and demonstrates that a hybrid human-AI screening approach is viable but requires clearer reporting of…
Ruihui Hou, Ziyue Huai, Chennuo Zhang, Ziyan Liu +4 more
CAREAgent is a novel agent designed for fine-grained clinical order generation, achieving significant performance improvements on unseen benchmarks by integrating structured reasoning and tool usage.
Yuxing Lu, Yushuhong Lin, Wenqi Shi, J. Ben Tamo +3 more
The paper introduces ClinEnv, a novel interactive, multi-stage benchmark designed to evaluate LLMs' decision-making and information-gathering process during longitudinal inpatient medical simulations.
The paper evaluates the semantic stability of clinical LLMs to linguistic variations, finding that domain specialization does not guarantee consistent robustness improvements.
Xinyu Wang, Hanwei Wu, Zhenghan Tai, Sicheng Lyu +6 more
The paper introduces SafeRx-Agent, a knowledge-grounded multi-agent framework that improves medication recommendation accuracy and safety by incorporating fine-grained ATC codes and rigorous safety ve…
Chao Ding, Mouxiao Bian, Tianbin Li, Minjia Yuan +11 more
The paper introduces SafeMed-R1, a clinically audited LLM that significantly improves safety and ethical alignment for medical applications, matching or exceeding resident performance on safety-critic…
Ruiyi Zhang, Peijia Qin, Qi Cao, Li Zhang +1 more
The paper introduces AIBuildAI-2, a knowledge-enhanced agent that significantly improves the automatic building of AI models by integrating an external, evolving knowledge system, achieving state-of-t…
The paper introduces the Causal Sensitivity Score (CSS), an interventional metric that reveals that standard coverage-based evaluations fail to detect critical responsiveness deficits in clinical LLMs…
The paper proposes 'Think Fast, Talk Smart,' a pipeline that separates deterministic data analysis from LLM generation, showing that offloading recurring, structured tasks to code significantly improv…
Qiuyu Tian, Zequn Liu, Yingce Xia, Haojie Yin +1 more
The paper introduces ForeSci, a novel benchmark that evaluates LLM agents' ability to make forward-looking research judgments using only historical evidence, finding that explicit evidence organizatio…
Przemyslaw Biecek, Luca Longo, Jianlong Zhou, Thomas Fel +2 more
The paper advocates for the establishment of Model Science, a systematic discipline that moves beyond simple benchmarking to deeply analyze AI models' internal workings and failure modes.
Yuzhang Xie, Keqi Han, Yunpeng Xiao, Hejie Cui +6 more
The paper introduces EHRBench, a large-scale, automated, and reliable benchmark derived from real Electronic Health Records (EHRs) to rigorously evaluate the clinical decision-making capabilities of L…
Drishti Goel, Agam Goyal, Veda Duddu, Olivia Pal +7 more
This study demonstrates that an LLM's assigned support role (e.g., Inform, Coach, Relate) significantly alters its safety profile and the types of risks it presents when assisting users in complex car…
The paper introduces Med-HEAL, a comprehensive framework and dataset for systematically identifying and mitigating hallucinations in medical LLMs, demonstrating that a self-critique pipeline significa…
The paper models healthcare mechanism design as program synthesis, demonstrating that an optimized, mixed-objective program can eliminate up-coding and reduce patient rejection while maintaining finan…
The paper investigates apparent LLM triage failures and concludes that the errors originate in the output format and decision process, rather than a deficiency in the model's underlying clinical knowl…
This paper evaluates multiple LLMs (DeepSeek-R1, OpenBioLLM-Llama3, Qwen 3.5) for generating privacy-safe, high-quality synthetic mental health reports, demonstrating their effectiveness in expanding…