Hardy Chen
2 indexed papers
Publications per year
Top categories
Frequent co-authors
Research Timeline
This paper conducts the first real-world safety evaluation of the personal AI agent OpenClaw, demonstrating that its broad system access creates inherent vulnerabilities that significantly increase the attack success rate regardless of the underlying large language model.
The paper introduces AutoMedBench, a novel workflow-aware benchmark that evaluates autonomous medical-AI agents across a five-stage research process, revealing that agents struggle most with validation and submission.
Papers
AutoMedBench: Towards Medical AutoResearch with Agentic AI Models
Junqi Liu, Salena Song, Yuhan Wang, Jiawei Mao +11 more
The paper introduces AutoMedBench, a novel workflow-aware benchmark that evaluates autonomous medical-AI agents across a five-stage research process, revealing that agents struggle most with validatio…