~ similar to 2605.09684v1· 20 results
The paper introduces an AI red teaming agent that drastically reduces the time and effort required for security testing by allowing operators to define complex attack goals using natural language, com…
Elle Najt, Colin Toft, Tyler Tracy, Fabien Roger +1 more
The paper introduces SLEIGHT-Bench, a benchmark of 40 synthetic attacks, demonstrating that current LLM monitor systems fail to detect a significant number of covert, harmful actions executed by codin…
Zenghao Duan, Yuxin Tian, Zhiyi Yin, Liang Pang +5 more
SkillAttack is a red-teaming framework that dynamically tests the exploitability of latent vulnerabilities in LLM agent skills using adversarial prompting, demonstrating that even benign skills pose s…
Davis Brown, Samarth Bhargav, Arav Santhanam, Kasper Hong +6 more
The paper introduces a novel stateful online monitoring system that detects distributed multi-agent cyberattacks by aggregating weak suspiciousness signals across many user accounts, overcoming the bl…
Davis Brown, Samarth Bhargav, Arav Santhanam, Kasper Hong +6 more
The paper introduces a novel stateful online monitoring system that detects distributed multi-agent cyberattacks by aggregating weak suspiciousness signals across many user accounts, overcoming the bl…
The paper introduces Proteus, a self-evolving red-team framework that measures the adaptive leakage risk of LLM agent skills, demonstrating that current vetting methods significantly underestimate res…
The paper evaluates Language Model Agents (LMAs) for red-teaming by benchmarking their ability to perform lateral movement, finding that expert-defined action plans are most effective, though all moda…
The paper introduces ClawTrap, a MITM-based red-teaming framework, to evaluate the security robustness of web agents like OpenClaw against dynamic, real-world network attacks, finding that model stren…
Hammad Atta, Ken Huang, Kyriakos Rock Lambros, Yasir Mehmood +10 more
The paper introduces LAAF, a novel automated red-teaming framework, to systematically test and exploit Logic-layer Prompt Control Injection (LPCI) vulnerabilities in complex agentic LLM systems.
AutoRISE proposes optimizing the entire attack strategy—by searching over executable programs—rather than just optimizing prompts, achieving significant improvements in red-teaming large language mode…
Huiyu Xu, Zhibo Wang, Wenhui Zhang, Ziqi Zhu +3 more
The paper introduces LoopTrap, an automated red-teaming framework that demonstrates how malicious prompts can poison the termination judgment of LLM agents, causing unbounded computation.
The paper proposes an autonomous red teaming framework combining LLMs and RL to generate sophisticated, multi-stage cyber attack campaigns, demonstrating its necessity for evaluating robust AI-enabled…
The paper proposes a general-purpose pipeline to train automated red teaming models capable of generating attacks for arbitrary adversarial goals, overcoming the limitations of current methods that ar…
Chia-Pei, Chen, Kentaroh Toyoda, Anita Lai +1 more
The paper introduces IPI-proxy, an open-source intercepting proxy toolkit designed to red-team web-browsing AI agents by injecting adversarial payloads into live HTTP responses from whitelisted domain…
TraceGuard introduces a structured, multi-dimensional monitoring protocol that significantly improves the detection of subtle attacks in AI agents while maintaining collusion resistance.
Hyomin Lee, Sangwoo Park, Yumin Choi, Sohyun An +2 more
The paper introduces T-MAP, a trajectory-aware evolutionary search method, to discover and generate multi-step adversarial prompts that exploit vulnerabilities in autonomous LLM agents through tool ex…
The paper introduces AGENTREDBENCH, a dynamic redteaming benchmark that significantly measures indirect prompt injection threats in LLM agents using third-party integrations, and releases AGENTREDGUAR…
The paper introduces AGENTREDBENCH, a dynamic redteaming benchmark that significantly measures indirect prompt injection threats in LLM agents using SaaS integrations, and releases AGENTREDGUARD, a su…
The paper benchmarks current frontier computer-using agents against hand-crafted attacks, finding that while they are highly safe in browser tasks, this safety does not generalize to other domains lik…
Yuting Ning, Zhehao Zhang, Yash Kumar Lal, Boyu Gou +7 more
The paper introduces SkillHarm, a comprehensive benchmark and automated framework for evaluating skill-based attacks across the entire agent skill-use lifecycle, demonstrating that current agents rema…