~ similar to 2606.01961· 20 results
AutoForest is an end-to-end system that automatically generates publication-ready forest plots directly from biomedical papers, streamlining the labor-intensive process of meta-analysis.
The paper introduces an agentic, framework-based system to transform under-specified academic papers into standardized, comparable, and executable benchmarks for industrial Prognostics and Health Mana…
Qing Wang, Bo Li, Jialu Liang, Daling Shi +2 more
The paper introduces DrugClaw, a multi-agent system, and DrugAudit, a new benchmark, demonstrating that DrugClaw excels at answering drug-related questions by grounding answers in primary regulatory s…
Yunqi Liu, Tong Niu, Zitong Wang, Zhenlong Dai +3 more
The paper introduces EgoBench, the first interactive multimodal benchmark designed to jointly evaluate advanced AI agents' capabilities in visual perception, multi-hop reasoning, and dynamic tool usag…
This study analyzes ClinicalTrials.gov records to track the rising trend of AI in clinical trials and demonstrates that a hybrid human-AI screening approach is viable but requires clearer reporting of…
The paper introduces SpatialBench-Long, a comprehensive benchmark designed to test AI agents' ability to perform end-to-end scientific reasoning and derive biological claims from complex, raw spatial…
Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more
The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…
Yeqi Huang, Yue Chen, Yanwei Ye, Guanhao Su +1 more
The paper introduces Ryze, an automated system that synthesizes evidence-enriched Question-Answering (QA) pairs from raw biomedical papers, resulting in a specialized VLM (BioVLM-8B) that significantl…
The paper introduces the Causal Sensitivity Score (CSS), an interventional metric that reveals that standard coverage-based evaluations fail to detect critical responsiveness deficits in clinical LLMs…
Mengyu Xu, Qiaoxin Yang, Qianqian Wang, Xiwei Dai +2 more
The paper introduces MIRA, a bilingual benchmark that reveals that LLMs tend to dilute or omit critical medical information when responding to prompts from users with low health literacy, a pattern te…
Yanan Wang, Shuaicong Hu, Jian Liu, Guohui Zhou +2 more
The paper proposes HetMedAgent, a multi-agent framework, demonstrating that combining generalist LLMs with domain-specific specialist models significantly improves medical AI performance by enabling s…
Chao Ding, Mouxiao Bian, Tianbin Li, Minjia Yuan +11 more
The paper introduces SafeMed-R1, a clinically audited LLM that significantly improves safety and ethical alignment for medical applications, matching or exceeding resident performance on safety-critic…
AutoScientists introduces a decentralized, self-organizing team of AI agents that significantly improves long-running scientific experimentation by enabling parallel exploration and knowledge sharing.
HypothesisMed introduces an inference-time pipeline for biomedical question answering that improves model reliability and structured output generation by fusing multiple model outputs and diagnosing t…
Tengfei Zhang, Ziheng Zhao, Lisong Dai, Xiaoman Zhang +4 more
This paper introduces MedReCo and MedReCo-VLM, a framework that enables entity-aware cross-image reasoning for medical imaging, allowing AI to compare current scans with prior studies and analogous ca…
Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta +5 more
The paper introduces BlueFin, a challenging benchmark for evaluating LLM agents on complex financial spreadsheet tasks, finding that even frontier models perform poorly, scoring less than 50% on avera…
Yuxing Lu, Yushuhong Lin, Wenqi Shi, J. Ben Tamo +3 more
The paper introduces ClinEnv, a novel interactive, multi-stage benchmark designed to evaluate LLMs' decision-making and information-gathering process during longitudinal inpatient medical simulations.
The paper introduces AgenticVBench, a comprehensive benchmark of 100 real-world video post-production tasks, and finds that even the best AI agents perform significantly worse than human experts on th…
Daniel Begimher, Cristian Leo, Jack Huang, Pat Gaw +1 more
The paper introduces SIR-Bench, a comprehensive benchmark of 794 test cases, to rigorously evaluate autonomous security incident response agents by measuring their ability to perform deep forensic inv…
Yuzhang Xie, Keqi Han, Yunpeng Xiao, Hejie Cui +6 more
The paper introduces EHRBench, a large-scale, automated, and reliable benchmark derived from real Electronic Health Records (EHRs) to rigorously evaluate the clinical decision-making capabilities of L…