~ similar to 2603.29062v1· 20 results
The paper introduces CSTM-Bench, a comprehensive benchmark and evaluation framework demonstrating that standard session-bound AI guardrails fail against sophisticated, cross-session attacks that accum…
Xinjie Shen, Rongzhe Wei, Peizhi Niu, Haoyu Wang +5 more
The paper introduces TurnGate, a response-aware defense mechanism that detects the earliest turn in a multi-turn dialogue where the accumulated interaction enables a harmful action, significantly impr…
Siyuan Li, Zehao Liu, Xi Lin, Qinghua Mao +5 more
CoopGuard is a novel stateful, multi-round defense framework using cooperative agents to significantly reduce the success rate of evolving adversarial attacks against Large Language Models.
This paper introduces MultiTurnPSB, a multi-turn adversarial benchmark, demonstrating that the safety of medical AI chatbots degrades significantly under sustained, real-world adversarial prompting, r…
The paper introduces Transient Turn Injection (TTI), a novel multi-turn attack technique that exploits stateless moderation in LLMs by distributing adversarial intent across isolated interactions, rev…
The paper proposes the Layered Attack Surface Model (LASM), a structural taxonomy that maps security threats and defenses across the complex, multi-layered architecture of AI agents, revealing signifi…
The paper establishes a standardized security assessment framework and develops a multi-layered defensive system, demonstrating that systematic testing and external defenses are crucial for safe LLM d…
AttackEval systematically evaluates the effectiveness of 250 prompt injection prompts across ten attack categories, finding that composite and obfuscation attacks are highly effective against current…
Yunhao Feng, Xiaohu Du, Xinhao Deng, Yifan Ding +12 more
BraveGuard is a self-evolving defense framework that significantly improves the safety monitoring of computer-use agents by generating guard model supervision from open-world threat discovery and real…
Yunhao Feng, Yifan Ding, Xiaohu Du, Ming Wen +12 more
BraveGuard is a self-evolving defense framework that improves the safety of computer-use agents by training guard models on open-world, multi-step threat trajectories rather than static benchmarks.
The paper introduces 'adversarial restlessness,' an activation-level signature in LLM residual streams, to detect multi-turn prompt injection attacks with high accuracy.
The Cognitive Firewall is a hybrid edge-cloud defense architecture that significantly reduces the attack success rate of Indirect Prompt Injection against browser-based AI agents by combining local vi…
The paper introduces a challenging benchmark for LLM agents to perform unsupervised threat hunting on raw Windows event logs, finding that current frontier models perform poorly and are not ready for…
enclawed is a configurable, hard-fork hardening framework for AI assistant gateways that enforces strict security controls, verifiable trust, and auditable connectivity for regulated environments.
Bowen Sun, Chaozhuo Li, Yaodong Yang, Yiwei Wang +1 more
TwinGate introduces a stateful dual-encoder defense framework using Asymmetric Contrastive Learning to detect malicious intent from fragmented, untraceable LLM queries with high recall and low false p…
Jianan Ma, Xiaohu Du, Ruixiao Lin, Yaoxiang Bian +7 more
The paper introduces a multi-dimensional evasion framework and a new benchmark (A3S-Bench) to test autonomous agents, demonstrating that stateful, multi-turn attacks significantly increase system risk…
The paper identifies a critical vulnerability, the Camouflage Detection Gap (CDG), where standard LLM injection detectors fail dramatically when malicious payloads mimic the target domain's language a…
ClawGuard is a novel runtime security framework that deterministically enforces user-confirmed rules at tool-call boundaries to protect LLM agents from indirect prompt injection.
Tri Cao, Yulin Chen, Hieu Cao, Yibo Li +7 more
The paper proposes WARD, a robust and efficient defense model that secures web agents against prompt injection attacks embedded in web content, achieving high recall and low false positives even again…
Paulo Ricardo Ferreira Neves, Edson Rodrigues da Cruz Filho, Paulo Henrique Eleuterio Falsetti, João Vitor Pavan +6 more
GuardNet is a lightweight, ensemble-based guardrail system using shallow neural networks that provides robust and efficient detection of Prompt Injection and Jailbreak attacks on LLMs, suitable for pr…