~ similar to 2605.23243v1· 20 results
This study provides a comprehensive benchmark of 10 frontier LLMs on 200 offensive cybersecurity tasks, finding that environment tooling and model selection are the primary performance drivers, with C…
The paper benchmarks current frontier computer-using agents against hand-crafted attacks, finding that while they are highly safe in browser tasks, this safety does not generalize to other domains lik…
The paper introduces COBALT, a Z3 SMT-based formal verification engine, to proactively detect arithmetic vulnerabilities (CWE-190/191/195) in the critical infrastructure surrounding frontier AI models…
The paper proposes Dynamic Cyber Ranges, an advanced cyber range environment using LLM-driven Defender agents to counter the saturation of traditional security benchmarks, demonstrating that these dyn…
Houjun Liu, Lisa Einstein, John Yang, Joachim Baumann +4 more
SecureForge is an automated pipeline that significantly reduces cybersecurity vulnerabilities in LLM-generated code by optimizing system prompts, achieving up to a 48% reduction in output vulnerabilit…
The paper establishes a standardized security assessment framework and develops a multi-layered defensive system, demonstrating that systematic testing and external defenses are crucial for safe LLM d…
The paper introduces a challenging benchmark for LLM agents to perform unsupervised threat hunting on raw Windows event logs, finding that current frontier models perform poorly and are not ready for…
The paper introduces ExploitBench, a capability-graded benchmark that measures the progressive stages of exploitation, demonstrating that while current frontier models can easily trigger bugs, achievi…
FORGE is a multi-agent system that integrates vulnerability exploitation, prioritization, and detection engineering into a single pipeline, achieving high-fidelity, multi-level exploitation and genera…
The paper proposes a unified closed-loop threat taxonomy to systematically analyze and defend foundation models by explicitly framing the bidirectional security interactions between data and models.
The paper proposes an attestation-aware promotion gate to mitigate supply-chain risks in LLM pipelines by cryptographically verifying and enforcing claims about training and release artifacts before d…
This study empirically measures the consistency and success rate of autonomous LLM penetration testing across multiple services, finding statistically significant differences in exploitation capabilit…
This study empirically measures the consistency and effectiveness of autonomous LLM penetration testing across multiple services, finding statistically significant differences in exploitation rates am…
Hwiwon Lee, Jiawei Liu, Dongjun Kim, Ziqi Zhang +2 more
The paper introduces SEC-bench Pro, a rigorous benchmark for evaluating LLM-based bug hunting on complex software, finding that even advanced agents struggle with long-horizon security tasks.
The paper analyzes the failure modes of current AI containment methods when the agent itself is the adversary, deriving five necessary architectural requirements for durable safety.
The paper introduces HIDBench, a new benchmark for evaluating LLMs' ability to perform host-based intrusion detection using complex, noisy system logs, finding that model performance degrades signific…
The paper introduces SecLens-R, a multi-stakeholder evaluation framework, demonstrating that LLM performance for vulnerability detection varies significantly depending on the specific priorities (e.g.…
Ayush Garg, Sophia Hager, Jacob Montiel, Aditya Tiwari +4 more
RuleForge is an automated system that generates and validates detection rules for web vulnerabilities from structured CVE templates, significantly improving detection accuracy and reducing false posit…
Hanzhi Liu, Chaofan Shou, Hongbo Wen, Yanju Chen +2 more
This paper systematically analyzes the threat posed by malicious third-party API routers in the LLM supply chain, finding that a significant number of routers actively perform payload injection, crede…
The paper introduces CyberCertBench, a new benchmark suite for evaluating LLMs against industry cybersecurity certifications, finding that while frontier models perform well on general knowledge, thei…