~ similar to 2606.01094· 20 results
Dongsheng Shi, Yue Li, Xin Yi, Yongyi Cui +2 more
The paper introduces SURGENT, a multi-agent assistance system designed for the entire perioperative workflow, which outperforms standard LLMs by providing context-aware, traceable, and privacy-preserv…
Junqi Liu, Salena Song, Yuhan Wang, Jiawei Mao +11 more
The paper introduces AutoMedBench, a novel workflow-aware benchmark that evaluates autonomous medical-AI agents across a five-stage research process, revealing that agents struggle most with validatio…
Yuxing Lu, Yushuhong Lin, Wenqi Shi, J. Ben Tamo +3 more
The paper introduces ClinEnv, a novel interactive, multi-stage benchmark designed to evaluate LLMs' decision-making and information-gathering process during longitudinal inpatient medical simulations.
This study analyzes ClinicalTrials.gov records to track the rising trend of AI in clinical trials and demonstrates that a hybrid human-AI screening approach is viable but requires clearer reporting of…
Yuzhang Xie, Keqi Han, Yunpeng Xiao, Hejie Cui +6 more
The paper introduces EHRBench, a large-scale, automated, and reliable benchmark derived from real Electronic Health Records (EHRs) to rigorously evaluate the clinical decision-making capabilities of L…
The paper introduces ClinPivot, a benchmark that tests whether clinical models can correctly adjust treatment decisions when new patient context constraints are introduced, finding that strong medical…
Xinyu Wang, Hanwei Wu, Zhenghan Tai, Sicheng Lyu +6 more
The paper introduces SafeRx-Agent, a knowledge-grounded multi-agent framework that improves medication recommendation accuracy and safety by incorporating fine-grained ATC codes and rigorous safety ve…
Yuwei Miao, Gen Li, Yunsheng Zeng, Xiandong Li +7 more
C-MIG is a novel retrieval-augmented generation framework that uses multi-view information gain to improve clinical diagnosis reasoning by providing richer, more nuanced reward signals than existing m…
Chao Ding, Mouxiao Bian, Tianbin Li, Minjia Yuan +11 more
The paper introduces SafeMed-R1, a clinically audited LLM that significantly improves safety and ethical alignment for medical applications, matching or exceeding resident performance on safety-critic…
Yanan Wang, Shuaicong Hu, Jian Liu, Guohui Zhou +2 more
The paper proposes HetMedAgent, a multi-agent framework, demonstrating that combining generalist LLMs with domain-specific specialist models significantly improves medical AI performance by enabling s…
The paper introduces the Causal Sensitivity Score (CSS), an interventional metric that reveals that standard coverage-based evaluations fail to detect critical responsiveness deficits in clinical LLMs…
Qiuyu Tian, Zequn Liu, Yingce Xia, Haojie Yin +1 more
The paper introduces ForeSci, a novel benchmark that evaluates LLM agents' ability to make forward-looking research judgments using only historical evidence, finding that explicit evidence organizatio…
BADGER is a unified, production-grade evaluation framework that integrates text-to-SQL assessment with agentic behavior evaluation, significantly outperforming existing benchmarks on industry queries.
The paper demonstrates that tool-augmented agentic AI can learn from prior field experiment data to automatically generate superior, domain-specific interventions, transforming one-shot A/B testing in…
The paper proposes a Sovereign AI architecture for clinical triage that ensures maximum security by performing all inference on-device and receiving data only through physically unidirectional channel…
The paper proposes 'Think Fast, Talk Smart,' a pipeline that separates deterministic data analysis from LLM generation, showing that offloading recurring, structured tasks to code significantly improv…
Yaxuan Kong, Qingren Yao, Yuqi Nie, Yichen Li +6 more
The paper introduces TimeSage-MT, a comprehensive multi-turn benchmark designed to rigorously test an LLM agent's ability to perform complex, evolving time series analysis, revealing critical gaps in…
Qing Wang, Bo Li, Jialu Liang, Daling Shi +2 more
The paper introduces DrugClaw, a multi-agent system, and DrugAudit, a new benchmark, demonstrating that DrugClaw excels at answering drug-related questions by grounding answers in primary regulatory s…
Di Zhu, Yu Yvonne Wu, Hong Jia, Aaqib Saeed +2 more
VitalAgent is a novel tool-augmented agentic framework that significantly improves physiological monitoring from wearable health data by enabling both reactive question answering and proactive, long-t…
The paper introduces CGM-Agent, a privacy-preserving framework that allows users to ask free-form questions about their continuous glucose data using LLMs while ensuring all computation remains local…