Papers similar to 2606.01961

~ similar to 2606.01961· 20 results

cs.CLcs.AIRecentJun 1, 2026

AutoForest: Automatically Generating Forest Plots from Biomedical Studies with End-to-End Evidence Extraction and Synthesis

Massimiliano Pronesti, Angelo Miculescu, Mohsin Kapdi, Paul Flanagan +7 more

AutoForest is an end-to-end system that automatically generates publication-ready forest plots directly from biomedical papers, streamlining the labor-intensive process of meta-analysis.

View →

cs.AIcs.LGcs.SERecentMay 27, 2026

From paper to benchmark: agentic, framework-based reproduction of under-specified methods in machine health intelligence

Raffael Theiler, Ludovico Comito, David Leko, Leandro Von Krannichfeldt +2 more

The paper introduces an agentic, framework-based system to transform under-specified academic papers into standardized, comparable, and executable benchmarks for industrial Prognostics and Health Mana…

View →

cs.CLRecentMay 31, 2026

DrugClaw and DrugAudit: A Primary-Source-Grounded Agent and Authority-Aware Benchmark for Drug-Information Question Answering

Qing Wang, Bo Li, Jialu Liang, Daling Shi +2 more

The paper introduces DrugClaw, a multi-agent system, and DrugAudit, a new benchmark, demonstrating that DrugClaw excels at answering drug-related questions by grounding answers in primary regulatory s…

View →

cs.AIRecentMay 27, 2026

EgoBench: An Interactive Egocentric Multimodal Benchmark for Tool-Using Agents

Yunqi Liu, Tong Niu, Zitong Wang, Zhenlong Dai +3 more

The paper introduces EgoBench, the first interactive multimodal benchmark designed to jointly evaluate advanced AI agents' capabilities in visual perception, multi-hop reasoning, and dynamic tool usag…

View →

cs.AIRecentMay 27, 2026

Trends in AI and Human-AI Interaction in Clinical Trials -- A Hybrid Human-AI Exploration

Sandra Woolley, Tim Collins, Khalid Khattak, Illia Chernomorets +2 more

This study analyzes ClinicalTrials.gov records to track the rising trend of AI in clinical trials and demonstrates that a hybrid human-AI screening approach is viable but requires clearer reporting of…

View →

cs.AIRecentMay 27, 2026

Verifiable Benchmarking of Long-Horizon Spatial Biology

Ian Diks, Harihara Muralidharan, Tim Proctor, Kenny Workman

The paper introduces SpatialBench-Long, a comprehensive benchmark designed to test AI agents' ability to perform end-to-end scientific reasoning and derive biological claims from complex, raw spatial…

View →

cs.AIRecentMay 27, 2026

Benchmarking AI for low-resource contexts: Thinking beyond leaderboards

Aakash Pant, Kavya Shah, Apoorv Agnihotri, Sneha Nikam +2 more

The paper critiques current AI benchmarking practices for low-resource settings, arguing that evaluation must shift focus from isolated model performance to the holistic performance of the deployed sy…

View →

cs.AIRecentMay 30, 2026

Ryze: Evidence-Enriched Data Synthesis from Biomedical Papers

Yeqi Huang, Yue Chen, Yanwei Ye, Guanhao Su +1 more

The paper introduces Ryze, an automated system that synthesizes evidence-enriched Question-Answering (QA) pairs from raw biomedical papers, resulting in a specialized VLM (BioVLM-8B) that significantl…

View →

cs.LGcs.AIcs.CLRecentMay 28, 2026

Counterfactual Evaluation Reveals Hidden Capability Profiles in Clinical LLMs and Agents

Matt Turk

The paper introduces the Causal Sensitivity Score (CSS), an interventional metric that reveals that standard coverage-based evaluations fail to detect critical responsiveness deficits in clinical LLMs…

View →

cs.AIcs.CLcs.CYRecentMay 27, 2026

MIRA: A Bilingual Benchmark for Medical Information Response Audit

Mengyu Xu, Qiaoxin Yang, Qianqian Wang, Xiwei Dai +2 more

The paper introduces MIRA, a bilingual benchmark that reveals that LLMs tend to dilute or omit critical medical information when responding to prompts from users with low health literacy, a pattern te…

View →

cs.AIcs.CLcs.LGRecentMay 28, 2026

Why Specialist Models Still Matter: A Heterogeneous Multi-Agent Paradigm for Medical Artificial Intelligence

Yanan Wang, Shuaicong Hu, Jian Liu, Guohui Zhou +2 more

The paper proposes HetMedAgent, a multi-agent framework, demonstrating that combining generalist LLMs with domain-specific specialist models significantly improves medical AI performance by enabling s…

View →

cs.AIRecentMay 27, 2026

SafeMed-R1: Clinician-Audited Safety and Ethics Alignment for Medical Large Language Models

Chao Ding, Mouxiao Bian, Tianbin Li, Minjia Yuan +11 more

The paper introduces SafeMed-R1, a clinically audited LLM that significantly improves safety and ethical alignment for medical applications, matching or exceeding resident performance on safety-critic…

View →

cs.AIRecentMay 27, 2026

AutoScientists: Self-Organizing Agent Teams for Long-Running Scientific Experimentation

Shanghua Gao, Ada Fang, Marinka Zitnik

AutoScientists introduces a decentralized, self-organizing team of AI agents that significantly improves long-running scientific experimentation by enabling parallel exploration and knowledge sharing.

View →

cs.CLRecentMay 31, 2026

HypothesisMed: Inference-Time Answer Fusion and Structured Hypothesis-Space Reporting for Biomedical Question Answering

Md Motaleb Hossen Manik, Ge Wang

HypothesisMed introduces an inference-time pipeline for biomedical question answering that improves model reliability and structured output generation by fusing multiple model outputs and diagnosing t…

View →

cs.CVcs.IRcs.LGRecentJun 4, 2026

A Vision-language Framework for Comparative Reasoning in Radiology

Tengfei Zhang, Ziheng Zhao, Lisong Dai, Xiaoman Zhang +4 more

This paper introduces MedReCo and MedReCo-VLM, a framework that enables entity-aware cross-image reasoning for medical imaging, allowing AI to compare current scans with prior studies and analogous ca…

View →

cs.SEcs.AIcs.CLRecentMay 29, 2026

BlueFin: Benchmarking LLM Agents on Financial Spreadsheets

Srivatsa Kundurthy, Clara Na, Colton Moraine, Anoushka Mohta +5 more

The paper introduces BlueFin, a challenging benchmark for evaluating LLM agents on complex financial spreadsheet tasks, finding that even frontier models perform poorly, scoring less than 50% on avera…

View →

cs.AIcs.CLcs.ETRecentJun 1, 2026

ClinEnv: An Interactive Multi-Stage Long Horizon EHR Environment for Agents

Yuxing Lu, Yushuhong Lin, Wenqi Shi, J. Ben Tamo +3 more

The paper introduces ClinEnv, a novel interactive, multi-stage benchmark designed to evaluate LLMs' decision-making and information-gathering process during longitudinal inpatient medical simulations.

View →

cs.CRcs.MMRecentMay 26, 2026

AgenticVBench: Can AI Agents Complete Real-World Post-Production Tasks?

Zongheng Cao, Yi Zheng, Rui Song, Xinyu Hu

The paper introduces AgenticVBench, a comprehensive benchmark of 100 real-world video post-production tasks, and finds that even the best AI agents perform significantly worse than human experts on th…

View →

cs.CRcs.AIcs.SERecentApr 13, 2026

SIR-Bench: Evaluating Investigation Depth in Security Incident Response Agents

Daniel Begimher, Cristian Leo, Jack Huang, Pat Gaw +1 more

The paper introduces SIR-Bench, a comprehensive benchmark of 794 test cases, to rigorously evaluate autonomous security incident response agents by measuring their ability to perform deep forensic inv…

View →

cs.AIRecentMay 28, 2026

EHRBench: An Automated and Reliable EHR-based Benchmark for Clinical Decision Making with LLMs

Yuzhang Xie, Keqi Han, Yunpeng Xiao, Hejie Cui +6 more

The paper introduces EHRBench, a large-scale, automated, and reliable benchmark derived from real Electronic Health Records (EHRs) to rigorously evaluate the clinical decision-making capabilities of L…

View →