~ similar to 2605.12673v1· 20 results
The paper introduces ExploitBench, a capability-graded benchmark that measures the progressive stages of exploitation, demonstrating that while current frontier models can easily trigger bugs, achievi…
Hwiwon Lee, Jiawei Liu, Dongjun Kim, Ziqi Zhang +2 more
The paper introduces SEC-bench Pro, a rigorous benchmark for evaluating LLM-based bug hunting on complex software, finding that even advanced agents struggle with long-horizon security tasks.
The paper benchmarks current frontier computer-using agents against hand-crafted attacks, finding that while they are highly safe in browser tasks, this safety does not generalize to other domains lik…
This paper identifies three core weaknesses—benchmark vulnerabilities, temporal staleness, and runtime uncertainty—that undermine current AI agent security evaluations and proposes directions for buil…
Elle Najt, Colin Toft, Tyler Tracy, Fabien Roger +1 more
The paper introduces SLEIGHT-Bench, a benchmark of 40 synthetic attacks, demonstrating that current LLM monitor systems fail to detect a significant number of covert, harmful actions executed by codin…
Yilun Yao, Xinyu Tan, Chao-Hsuan Liu, Yaoming Li +8 more
The paper introduces Harness-Bench, a diagnostic benchmark that measures how different system 'harnesses' affect LLM agent performance in realistic workflows, showing that agent capability must be rep…
The paper introduces Proteus, a self-evolving red-team framework that measures the adaptive leakage risk of LLM agent skills, demonstrating that current vetting methods significantly underestimate res…
The paper introduces MOSAIC-Bench, a benchmark demonstrating that coding agents can ship exploitable code by complying with seemingly innocuous, staged tasks, a vulnerability that is not easily mitiga…
Jiahao Huang, Fei Cheng, Junfeng Jiang, Zefan Yu +1 more
The paper introduces BenchTrace, a novel benchmark designed to rigorously evaluate the self-evolution and reflection capabilities of LLM agents, revealing that current models struggle with accurate fa…
Ivan Bercovich, Ivgeni Segal, Kexun Zhang, Shashwat Saxena +2 more
The paper introduces Terminal Wrench, a comprehensive dataset of 331 reward-hackable terminal-agent environments and 3,632 exploit trajectories, demonstrating that detection of reward hacking degrades…
Huiyu Xu, Zhibo Wang, Wenhui Zhang, Ziqi Zhu +3 more
The paper introduces LoopTrap, an automated red-teaming framework that demonstrates how malicious prompts can poison the termination judgment of LLM agents, causing unbounded computation.
Jianan Ma, Xiaohu Du, Ruixiao Lin, Yaoxiang Bian +7 more
The paper introduces a multi-dimensional evasion framework and a new benchmark (A3S-Bench) to test autonomous agents, demonstrating that stateful, multi-turn attacks significantly increase system risk…
Ismail Hossain, Sai Puppala, Zhuoran Lu, Sajedul Talukder +1 more
The paper introduces SkillVetBench, a novel two-stage benchmark that effectively detects and verifies malicious behavior in open agentic skill ecosystems, significantly outperforming existing static a…
Ismail Hossain, Sai Puppala, Zhuoran Lu, Sajedul Talukder +1 more
The paper introduces SkillVetBench, a novel two-stage benchmark that effectively detects and verifies malicious behavior hidden within open agentic skills, significantly outperforming static and seman…
The paper introduces Phoenix, a training-free multi-agent framework that detects code vulnerabilities by synthesizing project-specific behavioral contracts, significantly outperforming existing method…
Ali Al-Kaswan, Maksim Plotnikov, Maxim Hájek, Roland Vízner +2 more
The paper introduces DeepRed, a new benchmark for evaluating LLM agents in realistic CTF challenges, finding that current agents are limited, achieving only 35% average checkpoint completion.
Hengyu An, Minxi Li, Jinghuai Zhang, Naen Xu +5 more
The paper introduces ACIArena, a unified and comprehensive evaluation framework designed to systematically test the robustness of Multi-Agent Systems against complex Agent Cascading Injection attacks.
Tomer Keren, Nitay Calderon, Asaf Yehudai, Yotam Perlitz +2 more
The paper introduces TASTE, an automatic task synthesis method that generates challenging agent benchmarks by evolving tool sequences, demonstrating that existing benchmarks are saturated and that TAS…
Kevin Eykholt, Dhilung Kirat, Xiaokui Shu, Jiyong Jang +2 more
The paper reports on penetration tests conducted on proprietary, large-scale AI agent systems, finding that security vulnerabilities persist despite stricter development standards.
The paper introduces a novel, practical evaluation protocol that shifts the assessment of AI pentesting agents from simple task completion to validated, open-ended vulnerability discovery in complex,…