~ similar to 2606.00801· 20 results
The paper introduces a quality-diversity evolutionary framework that evolves interpretable attack strategies, successfully discovering distinct and systematic vulnerabilities in major LLMs like GPT-4o…
The paper proposes a general-purpose pipeline to train automated red teaming models capable of generating attacks for arbitrary adversarial goals, overcoming the limitations of current methods that ar…
Xiaozhe Zhang, Chaozhuo Li, Hui Liu, Shaocheng Yan +3 more
The EvoSafety framework enhances LLM safety by externalizing attack and defense mechanisms, enabling persistent, transferable, and model-agnostic robustness against adversarial prompts.
The study demonstrates that LLM safety alignment is non-monotonic across model generations, showing that Gemma 3 exhibits unexpectedly high vulnerability to adversarial attacks compared to both its pr…
The study demonstrates that safety alignment in LLMs is non-monotonic across model generations, showing that Gemma 3 exhibits a significantly higher attack success rate than both its predecessor and s…
Hyomin Lee, Sangwoo Park, Yumin Choi, Sohyun An +2 more
The paper introduces T-MAP, a trajectory-aware evolutionary search method, to discover and generate multi-step adversarial prompts that exploit vulnerabilities in autonomous LLM agents through tool ex…
This paper quantifies the polymorphic capacity of a commercial LLM, demonstrating that it can cheaply generate large populations of structurally diverse, yet behaviorally equivalent, offensive code pa…
Hammad Atta, Ken Huang, Kyriakos Rock Lambros, Yasir Mehmood +10 more
The paper introduces LAAF, a novel automated red-teaming framework, to systematically test and exploit Logic-layer Prompt Control Injection (LPCI) vulnerabilities in complex agentic LLM systems.
The paper establishes a standardized security assessment framework and develops a multi-layered defensive system, demonstrating that systematic testing and external defenses are crucial for safe LLM d…
Youness Bouchari, Matteo Boffa, Marco Mellia, Idilio Drago +2 more
The paper re-evaluates LLM agents on CTFs, finding that while general-purpose agents like claude-code are strong baselines, specialized, modular architectures significantly improve performance and con…
This study empirically measures the consistency and success rate of autonomous LLM penetration testing across multiple services, finding statistically significant differences in exploitation capabilit…
This study empirically measures the consistency and effectiveness of autonomous LLM penetration testing across multiple services, finding statistically significant differences in exploitation rates am…
The paper identifies a critical vulnerability, the Camouflage Detection Gap (CDG), where standard LLM injection detectors fail dramatically when malicious payloads mimic the target domain's language a…
The paper benchmarks current frontier computer-using agents against hand-crafted attacks, finding that while they are highly safe in browser tasks, this safety does not generalize to other domains lik…
This paper identifies the 'Format-Reliability Gap'—where LLMs know about code vulnerabilities but generate insecure code anyway—and proposes a localized, per-vulnerability steering vector fix that sig…
Jiaqing Li, Zhibo Zhang, Shide Zhou, Yuxi Li +2 more
The paper introduces TrojanMerge, a framework demonstrating that model merging can be exploited to systematically compromise the safety alignment of multiple individually safe LLMs.
The paper introduces an AI red teaming agent that drastically reduces the time and effort required for security testing by allowing operators to define complex attack goals using natural language, com…
Jiacheng Liang, Yao Ma, Tharindu Kumarage, Satyapriya Krishna +4 more
ARES is a novel framework that systematically discovers and mitigates dual vulnerabilities in RLHF systems by simultaneously testing the core LLM and its Reward Model (RM) using structured adversarial…
Houjun Liu, Lisa Einstein, John Yang, Joachim Baumann +4 more
SecureForge is an automated pipeline that significantly reduces cybersecurity vulnerabilities in LLM-generated code by optimizing system prompts, achieving up to a 48% reduction in output vulnerabilit…
ZERO-APT introduces a novel closed-loop adversarial framework for automated penetration testing that simulates attacks against an intelligent, real-time defending system, achieving a high attack succe…