~ similar to 2606.02822v1· 20 results
The paper establishes a standardized security assessment framework and develops a multi-layered defensive system, demonstrating that systematic testing and external defenses are crucial for safe LLM d…
This study empirically measures the consistency and success rate of autonomous LLM penetration testing across multiple services, finding statistically significant differences in exploitation capabilit…
This study empirically measures the consistency and effectiveness of autonomous LLM penetration testing across multiple services, finding statistically significant differences in exploitation rates am…
The paper systematically evaluates various defense mechanisms against persistent memory attacks on LLM agents, finding that only tool-gating at the memory layer (Memory Sandbox) effectively mitigates…
The paper introduces a comprehensive taxonomy and auditing framework to assess the collective coverage of existing LLM attack benchmarks, revealing significant and systematic gaps in current testing m…
The study evaluates how safety alignment affects autonomous security agents using a comprehensive trace-based benchmark, finding that while less-restricted models show gains, these effects are not uni…
The paper introduces 'abliteration,' a weight editing technique that successfully bypasses the refusal mechanism of safety-aligned Code LLMs, enabling scalable synthesis of vulnerable code from safe i…
The paper proposes Ablating Safety, a controlled protocol for removing safety alignment from language models, demonstrating that targeted de-alignment can significantly boost security performance whil…
The paper introduces a challenging benchmark for LLM agents to perform unsupervised threat hunting on raw Windows event logs, finding that current frontier models perform poorly and are not ready for…
This paper introduces Swiss-Bench 003, an expanded evaluation framework assessing LLM reliability and adversarial security across eight dimensions using 808 Swiss-specific items, revealing that self-g…
The paper introduces an automated framework demonstrating that LLM system instructions are vulnerable to encoding attacks, where structured output requests can bypass safety refusals and leak sensitiv…
The paper introduces Refute-or-Promote, an adversarial multi-agent review system that significantly improves the precision of LLM-assisted defect discovery by filtering out false positives.
Xiangtao Meng, Wenyu Chen, Chuanchao Zang, Xinyu Gao +4 more
This paper systematically measures and explains how sequential model defenses can conflict, finding that 38.9% of ordered defense sequences cause measurable risk exacerbation due to anti-aligned param…
The paper introduces a validated, consensus-labeled prompt bank that separates requests for executable malicious code (weapons) from requests for general harmful security knowledge, providing a more g…
ZERO-APT introduces a novel closed-loop adversarial framework for automated penetration testing that simulates attacks against an intelligent, real-time defending system, achieving a high attack succe…
The paper evaluates graph-context LLM defenders against multi-round, adaptive fraud attacks, finding that while graph context improves early safety, it significantly increases benign over-refusal due…
The paper introduces GuardPhish, a large-scale dataset and evaluation framework, demonstrating that even high-performing open-source LLMs can generate actionable phishing content despite accurate inte…
The paper introduces False Security Confidence (FSC), a new metric to measure the inherent prevalence of security vulnerabilities in code generated by LLMs that are otherwise functionally correct, eve…
The paper introduces a novel framework to evaluate when and how AI agents should refuse harmful requests in offensive cybersecurity tasks, finding that most state-of-the-art models exhibit dangerously…
Tingda Shen, Yebo Feng, Konglin Zhu, Xiaojun Jia +2 more
The paper introduces SIGIL, a novel framework that cryptographically seals the entire lifecycle of LLM skills, ensuring verifiable integrity from publication through runtime execution to prevent suppl…