Papers similar to 2606.01094

~ similar to 2606.01094· 20 results

cs.CLcs.AIRecentMay 28, 2026

SURGENT: A Surgical Multi-Agent Assistance System Across the Perioperative Workflow

Dongsheng Shi, Yue Li, Xin Yi, Yongyi Cui +2 more

The paper introduces SURGENT, a multi-agent assistance system designed for the entire perioperative workflow, which outperforms standard LLMs by providing context-aware, traceable, and privacy-preserv…

View →

cs.AIRecentJun 1, 2026

AutoMedBench: Towards Medical AutoResearch with Agentic AI Models

Junqi Liu, Salena Song, Yuhan Wang, Jiawei Mao +11 more

The paper introduces AutoMedBench, a novel workflow-aware benchmark that evaluates autonomous medical-AI agents across a five-stage research process, revealing that agents struggle most with validatio…

View →

cs.AIcs.CLcs.ETRecentJun 1, 2026

ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents

Yuxing Lu, Yushuhong Lin, Wenqi Shi, J. Ben Tamo +3 more

The paper introduces ClinEnv, a novel interactive, multi-stage benchmark designed to evaluate LLMs' decision-making and information-gathering process during longitudinal inpatient medical simulations.

View →

cs.AIRecentMay 27, 2026

Trends in AI and Human-AI Interaction in Clinical Trials -- A Hybrid Human-AI Exploration

Sandra Woolley, Tim Collins, Khalid Khattak, Illia Chernomorets +2 more

This study analyzes ClinicalTrials.gov records to track the rising trend of AI in clinical trials and demonstrates that a hybrid human-AI screening approach is viable but requires clearer reporting of…

View →

cs.AIRecentMay 28, 2026

EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

Yuzhang Xie, Keqi Han, Yunpeng Xiao, Hejie Cui +6 more

The paper introduces EHRBench, a large-scale, automated, and reliable benchmark derived from real Electronic Health Records (EHRs) to rigorously evaluate the clinical decision-making capabilities of L…

View →

cs.AIRecentMay 27, 2026

Do Clinical Models Change Treatment Decisions?

Dongkyu Cho, Miao Zhang, Rumi Chunara

The paper introduces ClinPivot, a benchmark that tests whether clinical models can correctly adjust treatment decisions when new patient context constraints are introduced, finding that strong medical…

View →

cs.CLcs.AIRecentMay 27, 2026

SafeRx-Agent: A Knowledge-Grounded Multi-Agent Framework for Safe and Explainable Medication Recommendation

Xinyu Wang, Hanwei Wu, Zhenghan Tai, Sicheng Lyu +6 more

The paper introduces SafeRx-Agent, a knowledge-grounded multi-agent framework that improves medication recommendation accuracy and safety by incorporating fine-grained ATC codes and rigorous safety ve…

View →

cs.AIRecentMay 27, 2026

C-MIG: Multi-view Information Gain-based Retrieval-Augmented Generation for Clinical Diagnosis Reasoning

Yuwei Miao, Gen Li, Yunsheng Zeng, Xiandong Li +7 more

C-MIG is a novel retrieval-augmented generation framework that uses multi-view information gain to improve clinical diagnosis reasoning by providing richer, more nuanced reward signals than existing m…

View →

cs.AIRecentMay 27, 2026

SafeMed-R1: Clinician-Audited Safety and Ethics Alignment for Medical Large Language Models

Chao Ding, Mouxiao Bian, Tianbin Li, Minjia Yuan +11 more

The paper introduces SafeMed-R1, a clinically audited LLM that significantly improves safety and ethical alignment for medical applications, matching or exceeding resident performance on safety-critic…

View →

cs.AIcs.CLcs.LGRecentMay 28, 2026

Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence

Yanan Wang, Shuaicong Hu, Jian Liu, Guohui Zhou +2 more

The paper proposes HetMedAgent, a multi-agent framework, demonstrating that combining generalist LLMs with domain-specific specialist models significantly improves medical AI performance by enabling s…

View →

cs.LGcs.AIcs.CLRecentMay 28, 2026

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

Matt Turk

The paper introduces the Causal Sensitivity Score (CSS), an interventional metric that reveals that standard coverage-based evaluations fail to detect critical responsiveness deficits in clinical LLMs…

View →

cs.AIRecentMay 30, 2026

ForeSci: Evaluating LLM Agents for Forward-Looking AI Research Judgment

Qiuyu Tian, Zequn Liu, Yingce Xia, Haojie Yin +1 more

The paper introduces ForeSci, a novel benchmark that evaluates LLM agents' ability to make forward-looking research judgments using only historical evidence, finding that explicit evidence organizatio…

View →

cs.AIRecentJun 1, 2026

BADGER: Bridging Agentic and Deterministic Evaluation for Generative Enterprise Reasoning

Shannon Serrao, Soumitra Chatterjee, Dorina Strori, Abhishek Sharma +1 more

BADGER is a unified, production-grade evaluation framework that integrates text-to-SQL assessment with agentic behavior evaluation, significantly outperforming existing benchmarks on industry queries.

View →

cs.AIRecentJun 1, 2026

Beyond One-shot: AI Agents for Learning in Field Experiments

Junjie Luo, Ritu Agarwal, Gordon Gao

The paper demonstrates that tool-augmented agentic AI can learn from prior field experiment data to automatically generate superior, domain-specific interventions, transforming one-shot A/B testing in…

View →

cs.CRcs.AIcs.NIRecentMar 26, 2026

Sovereign AI at the Front Door of Care: A Physically Unidirectional Architecture for Secure Clinical Intelligence

Vasu Srinivasan, Dhriti Vasu

The paper proposes a Sovereign AI architecture for clinical triage that ensures maximum security by performing all inference on-device and receiving data only through physically unidirectional channel…

View →

cs.AIRecentMay 28, 2026

Think Fast, Talk Smart: Partitioning Deterministic and Neural Computation for Structured Health Text Generation

Kai-Chen Cheng, Haejun Han, David Q. Sun

The paper proposes 'Think Fast, Talk Smart,' a pipeline that separates deterministic data analysis from LLM generation, showing that offloading recurring, structured tasks to code significantly improv…

View →

cs.CLcs.AIRecentMay 31, 2026

TimeSage-MT: A Multi-Turn Benchmark for Evaluating Agentic Time Series Reasoning

Yaxuan Kong, Qingren Yao, Yuqi Nie, Yichen Li +6 more

The paper introduces TimeSage-MT, a comprehensive multi-turn benchmark designed to rigorously test an LLM agent's ability to perform complex, evolving time series analysis, revealing critical gaps in…

View →

cs.CLRecentMay 31, 2026

DrugClaw and DrugAudit: A Primary-Source-Grounded Agent and Authority-Aware Benchmark for Drug-Information Question Answering

Qing Wang, Bo Li, Jialu Liang, Daling Shi +2 more

The paper introduces DrugClaw, a multi-agent system, and DrugAudit, a new benchmark, demonstrating that DrugClaw excels at answering drug-related questions by grounding answers in primary regulatory s…

View →

cs.AIRecentMay 28, 2026

VitalAgent: A Tool-Augmented Agent for Reactive and Proactive Physiological Monitoring over Wearable Health Data

Di Zhu, Yu Yvonne Wu, Hong Jia, Aaqib Saeed +2 more

VitalAgent is a novel tool-augmented agentic framework that significantly improves physiological monitoring from wearable health data by enabling both reactive question answering and proactive, long-t…

View →

cs.AIcs.CRRecentApr 18, 2026

If Only My CGM Could Speak: A Privacy-Preserving Agent for Question Answering over Continuous Glucose Data

Yanjun Cui, Ali Emami, Temiloluwa Prioleau, Nikhil Singh

The paper introduces CGM-Agent, a privacy-preserving framework that allows users to ask free-form questions about their continuous glucose data using LLMs while ensuring all computation remains local…

View →