~ similar to 2603.17239v1· 20 results
The paper evaluates Language Model Agents (LMAs) for red-teaming by benchmarking their ability to perform lateral movement, finding that expert-defined action plans are most effective, though all moda…
The paper proposes a general-purpose pipeline to train automated red teaming models capable of generating attacks for arbitrary adversarial goals, overcoming the limitations of current methods that ar…
The paper introduces an AI red teaming agent that drastically reduces the time and effort required for security testing by allowing operators to define complex attack goals using natural language, com…
Red-MIRROR is a novel multi-agent LLM system that automates complex web penetration testing by integrating a memory-reflection backbone, achieving superior performance on industry benchmarks.
The paper proposes the Layered Attack Surface Model (LASM), a structural taxonomy that maps security threats and defenses across the complex, multi-layered architecture of AI agents, revealing signifi…
The paper proposes an autonomous red teaming framework combining LLMs and RL to generate sophisticated, multi-stage cyber attack campaigns, demonstrating its necessity for evaluating robust AI-enabled…
AutoRISE proposes optimizing the entire attack strategy—by searching over executable programs—rather than just optimizing prompts, achieving significant improvements in red-teaming large language mode…
Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu +3 more
This paper introduces ClawTrojan, a benchmark for multi-step trojan attacks against LLM agents, and proposes DASGuard, a dynamic defense mechanism that traces and sanitizes untrusted control content i…
Jiejun Tan, Zhicheng Dou, Xinyu Yang, Yuyang Hu +3 more
The paper introduces ClawTrojan, a benchmark for multi-step trojan attacks against LLM agents, and proposes DASGuard, a defense mechanism that detects and sanitizes backdoor content planted across mul…
The paper introduces MonitoringBench, a semi-automated red-teaming methodology that generates diverse and stronger attacks, revealing that current coding-agent monitors often fail against sophisticate…
The paper systematically evaluates various defense mechanisms against persistent memory attacks on LLM agents, finding that only tool-gating at the memory layer (Memory Sandbox) effectively mitigates…
The paper introduces FlashRT, a novel framework that significantly improves the computational and memory efficiency of optimization-based red-teaming attacks against long-context LLMs, enabling system…
This study empirically measures the consistency and success rate of autonomous LLM penetration testing across multiple services, finding statistically significant differences in exploitation capabilit…
This study empirically measures the consistency and effectiveness of autonomous LLM penetration testing across multiple services, finding statistically significant differences in exploitation rates am…
This paper systematically maps the expanded attack surface of agentic AI systems, identifying new threat vectors like RAG poisoning and cross-agent manipulation, and proposes a comprehensive security…
Huiyu Xu, Zhibo Wang, Wenhui Zhang, Ziqi Zhu +3 more
The paper introduces LoopTrap, an automated red-teaming framework that demonstrates how malicious prompts can poison the termination judgment of LLM agents, causing unbounded computation.
Hyomin Lee, Sangwoo Park, Yumin Choi, Sohyun An +2 more
The paper introduces T-MAP, a trajectory-aware evolutionary search method, to discover and generate multi-step adversarial prompts that exploit vulnerabilities in autonomous LLM agents through tool ex…
Automation-Exploit is a multi-agent LLM framework that enables adaptive offensive security by using a digital twin to safely test and execute high-risk memory-corruption exploits on live targets.
The paper introduces an automated framework demonstrating that LLM system instructions are vulnerable to encoding attacks, where structured output requests can bypass safety refusals and leak sensitiv…
The paper introduces a quality-diversity evolutionary framework that evolves interpretable attack strategies, successfully discovering distinct and systematic vulnerabilities in major LLMs like GPT-4o…