~ similar to 2604.19354v2· 20 results
Youness Bouchari, Matteo Boffa, Marco Mellia, Idilio Drago +2 more
The paper re-evaluates LLM agents on CTFs, finding that while general-purpose agents like claude-code are strong baselines, specialized, modular architectures significantly improve performance and con…
The paper introduces CTFusion, a novel streaming evaluation framework built on Live CTFs, to provide a robust and reliable benchmark for assessing LLM agents in cybersecurity tasks.
The paper evaluates Language Model Agents (LMAs) for red-teaming by benchmarking their ability to perform lateral movement, finding that expert-defined action plans are most effective, though all moda…
Red-MIRROR is a novel multi-agent LLM system that automates complex web penetration testing by integrating a memory-reflection backbone, achieving superior performance on industry benchmarks.
The paper introduces STRIATUM-CTF, a modular agentic framework that uses a standardized context protocol to enable LLMs to perform multi-step, stateful reasoning for general-purpose CTF solving, achie…
The paper introduces a novel, practical evaluation protocol that shifts the assessment of AI pentesting agents from simple task completion to validated, open-ended vulnerability discovery in complex,…
Huiyu Xu, Zhibo Wang, Wenhui Zhang, Ziqi Zhu +3 more
The paper introduces LoopTrap, an automated red-teaming framework that demonstrates how malicious prompts can poison the termination judgment of LLM agents, causing unbounded computation.
This study provides a comprehensive benchmark of 10 frontier LLMs on 200 offensive cybersecurity tasks, finding that environment tooling and model selection are the primary performance drivers, with C…
The paper introduces ExploitBench, a capability-graded benchmark that measures the progressive stages of exploitation, demonstrating that while current frontier models can easily trigger bugs, achievi…
The paper systematically maps LLM agent vulnerabilities by testing 10,000 prompt variations, finding that 'goal reframing' language is the primary trigger for exploitation, rather than broad adversari…
Hao Wang, Hanchen Li, Qiuyang Mang, Alvin Cheung +2 more
The paper introduces BenchJack, an automated red-teaming system that systematically audits popular AI agent benchmarks, revealing numerous reward-hacking exploits and demonstrating a method to signifi…
The paper benchmarks current frontier computer-using agents against hand-crafted attacks, finding that while they are highly safe in browser tasks, this safety does not generalize to other domains lik…
The paper proposes a general-purpose pipeline to train automated red teaming models capable of generating attacks for arbitrary adversarial goals, overcoming the limitations of current methods that ar…
The paper proposes the Layered Attack Surface Model (LASM), a structural taxonomy that maps security threats and defenses across the complex, multi-layered architecture of AI agents, revealing signifi…
The paper proposes Dynamic Cyber Ranges, an advanced cyber range environment using LLM-driven Defender agents to counter the saturation of traditional security benchmarks, demonstrating that these dyn…
The paper introduces a challenging benchmark for LLM agents to perform unsupervised threat hunting on raw Windows event logs, finding that current frontier models perform poorly and are not ready for…
Hammad Atta, Ken Huang, Kyriakos Rock Lambros, Yasir Mehmood +10 more
The paper introduces LAAF, a novel automated red-teaming framework, to systematically test and exploit Logic-layer Prompt Control Injection (LPCI) vulnerabilities in complex agentic LLM systems.
The paper introduces an AI red teaming agent that drastically reduces the time and effort required for security testing by allowing operators to define complex attack goals using natural language, com…
The paper introduces a novel framework to evaluate when and how AI agents should refuse harmful requests in offensive cybersecurity tasks, finding that most state-of-the-art models exhibit dangerously…
Chiyu Zhang, Huiqin Yang, Bendong Jiang, Xiaolei Zhang +7 more
The paper introduces LITMUS, a novel benchmark that rigorously tests LLM agents for dangerous, physical-layer behavioral jailbreaks in real OS environments, revealing that current agents frequently ex…