Papers similar to 2606.02822v1

~ similar to 2606.02822v1· 20 results

cs.CRcs.AIRecentMar 17, 2026

Security Assessment and Mitigation Strategies for Large Language Models: A Comprehensive Defensive Framework

The paper establishes a standardized security assessment framework and develops a multi-layered defensive system, demonstrating that systematic testing and external defenses are crucial for safe LLM d…

View →

cs.CRcs.AIRecentMay 28, 2026

How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency

Galip Tolga Erdem

This study empirically measures the consistency and success rate of autonomous LLM penetration testing across multiple services, finding statistically significant differences in exploitation capabilit…

View →

cs.CRcs.AIRecentMay 28, 2026

How Reliable Are AI Attackers Against a Fixed Vulnerable Target? A 400-Run Empirical Study of LLM Penetration Testing Consistency

Galip Tolga Erdem

This study empirically measures the consistency and effectiveness of autonomous LLM penetration testing across multiple services, finding statistically significant differences in exploitation rates am…

View →

cs.CRcs.AIcs.LGRecentMay 8, 2026

Defense effectiveness across architectural layers: a mechanistic evaluation of persistent memory attacks on stateful LLM agents

Jun Wen Leong

The paper systematically evaluates various defense mechanisms against persistent memory attacks on LLM agents, finding that only tool-gating at the memory layer (Memory Sandbox) effectively mitigates…

View →

cs.CRcs.CLRecentMay 14, 2026

Talk is (Not) Cheap: A Taxonomy and Benchmark Coverage Audit for LLM Attacks

Karthik Raghu Iyer, Yazdan Jamshidi, Nicholas Bray, Alexey A. Shvets

The paper introduces a comprehensive taxonomy and auditing framework to assess the collective coverage of existing LLM attack benchmarks, revealing significant and systematic gaps in current testing m…

View →

cs.CRcs.AIRecentMay 19, 2026

Measuring Safety Alignment Effects in Autonomous Security Agents

Isaac David, Arthur Gervais

The study evaluates how safety alignment affects autonomous security agents using a comprehensive trace-based benchmark, finding that while less-restricted models show gains, these effects are not uni…

View →

cs.CRcs.AIcs.SERecentJun 3, 2026

Willing but Unable: Separating Refusal from Capability in Code LLMs via Abliteration

Cristina Carleo, Pietro Liguori, Naghmeh Ivaki, Domenico Cotroneo

The paper introduces 'abliteration,' a weight editing technique that successfully bypasses the refusal mechanism of safety-aligned Code LLMs, enabling scalable synthesis of vulnerable code from safe i…

View →

cs.CRcs.AIRecentMay 17, 2026

Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications

Isaac David, Arthur Gervais

The paper proposes Ablating Safety, a controlled protocol for removing safety alignment from language models, demonstrating that targeted de-alignment can significantly boost security performance whil…

View →

cs.CRcs.AIRecentApr 21, 2026

Cyber Defense Benchmark: Agentic Threat Hunting Evaluation for LLMs in SecOps

Alankrit Chona, Igor Kozlov, Ambuj Kumar

The paper introduces a challenging benchmark for LLM agents to perform unsupervised threat hunting on raw Windows event logs, finding that current frontier models perform poorly and are not ready for…

View →

cs.CRcs.AIcs.CLRecentApr 7, 2026

Swiss-Bench 003: Evaluating LLM Reliability and Adversarial Security for Swiss Regulatory Contexts

Fatih Uenal

This paper introduces Swiss-Bench 003, an expanded evaluation framework assessing LLM reliability and adversarial security across eight dimensions using 808 Swiss-specific items, revealing that self-g…

View →

cs.CRcs.AIRecentApr 1, 2026

Automated Framework to Evaluate and Harden LLM System Instructions against Encoding Attacks

Anubhab Sahu, Diptisha Samanta, Reza Soosahabi

The paper introduces an automated framework demonstrating that LLM system instructions are vulnerable to encoding attacks, where structured output requests can bypass safety refusals and leak sensitiv…

View →

cs.CRcs.AIcs.SERecentApr 21, 2026

Refute-or-Promote: An Adversarial Stage-Gated Multi-Agent Review Methodology for High-Precision LLM-Assisted Defect Discovery

Abhinav Agarwal

The paper introduces Refute-or-Promote, an adversarial multi-agent review system that significantly improves the precision of LLM-assisted defect discovery by filtering out false positives.

View →

cs.CRRecentMay 14, 2026

Defenses at Odds: Measuring and Explaining Defense Conflicts in Large Language Models

Xiangtao Meng, Wenyu Chen, Chuanchao Zang, Xinyu Gao +4 more

This paper systematically measures and explains how sequential model defenses can conflict, finding that 38.9% of ordered defense sequences cause measurable risk exacerbation due to anti-aligned param…

View →

cs.CRcs.SERecentMay 4, 2026

A Validated Prompt Bank for Malicious Code Generation: Separating Executable Weapons from Security Knowledge in 1,554 Consensus-Labeled Prompts

Richard J. Young, Gregory D. Moody

The paper introduces a validated, consensus-labeled prompt bank that separates requests for executable malicious code (weapons) from requests for general harmful security knowledge, providing a more g…

View →

cs.CRcs.MARecentJun 4, 2026

ZERO-APT: A Closed-Loop Adversarial Framework for LLM-Driven Automated Penetration Testing under Intelligent Defense

Anlan Zheng, Tiantian Zhu

ZERO-APT introduces a novel closed-loop adversarial framework for automated penetration testing that simulates attacks against an intelligent, real-time defending system, achieving a high attack succe…

View →

cs.CRRecentMay 20, 2026

Rethinking Fraud Safety Evaluation: Multi-Round Attacks Reveal Safety-Utility Tradeoffs in Graph-Context LLM Defenders

Laura Jiang, Reza Ryan, Qian Li, Nasim Ferdosian

The paper evaluates graph-context LLM defenders against multi-round, adaptive fraud attacks, finding that while graph context improves early safety, it significantly increases benign over-refusal due…

View →

cs.CRRecentApr 19, 2026

GuardPhish: Securing Open-Source LLMs from Phishing Abuse

Rina Mishra, Gaurav Varshney, Doddipatla Sesha Sahithi

The paper introduces GuardPhish, a large-scale dataset and evaluation framework, demonstrating that even high-performing open-source LLMs can generate actionable phishing content despite accurate inte…

View →

cs.CRRecentApr 18, 2026

False Security Confidence in Benign LLM Code Generation

Xiaolei Ren

The paper introduces False Security Confidence (FSC), a new metric to measure the inherent prevalence of security vulnerabilities in code generated by LLMs that are otherwise functionally correct, eve…

View →

cs.CRcs.AIRecentMay 31, 2026

A New Framework for Cybersecurity Refusals in AI Agents

Eliot Krzysztof Jones, Mateusz Dziemian, Matt Fredrikson, J Zico Kolter

The paper introduces a novel framework to evaluate when and how AI agents should refuse harmful requests in offensive cybersecurity tasks, finding that most state-of-the-art models exhibit dangerously…

View →

cs.CRRecentMay 6, 2026

Sealing the Audit-Runtime Gap for LLM Skills

Tingda Shen, Yebo Feng, Konglin Zhu, Xiaojun Jia +2 more

The paper introduces SIGIL, a novel framework that cryptographically seals the entire lifecycle of LLM skills, ensuring verifiable integrity from publication through runtime execution to prevent suppl…

View →