~ similar to 2605.05630v2· 20 results
The paper introduces Transient Turn Injection (TTI), a novel multi-turn attack technique that exploits stateless moderation in LLMs by distributing adversarial intent across isolated interactions, rev…
Zhiqing Ma, Zhonghao Xu, Dong Yu, Chen Kang +2 more
THRD introduces a novel, training-free framework that models temporal risk accumulation to effectively defend against multi-turn jailbreak attacks on LLMs, significantly reducing attack success rates…
The paper introduces 'adversarial restlessness,' an activation-level signature in LLM residual streams, to detect multi-turn prompt injection attacks with high accuracy.
This paper introduces MultiTurnPSB, a multi-turn adversarial benchmark, demonstrating that the safety of medical AI chatbots degrades significantly under sustained, real-world adversarial prompting, r…
AttackEval systematically evaluates the effectiveness of 250 prompt injection prompts across ten attack categories, finding that composite and obfuscation attacks are highly effective against current…
This paper introduces HarmAmp, a new benchmark for multi-turn harm amplification, and proposes TrajSafe, a proactive monitoring system that significantly reduces harmfulness in LLM interactions while…
The paper proposes the Triple-tier Anomaly Defense (TRIAD) framework, a predictive model that treats safety verification as a dynamic trajectory problem to detect cumulative, cross-modal poisoning in…
CivicShield introduces a novel, seven-layered defense-in-depth framework that significantly enhances the security of government-facing AI chatbots against sophisticated multi-turn adversarial attacks.
Xiaoyu Xu, Minxin Du, Qipeng Xie, Haobin Ke +2 more
The paper identifies 'unintended long-term state poisoning'—a security risk where routine user interactions gradually corrupt an LLM agent's persistent state—and proposes a defense mechanism called St…
Yongxiang Li, Moxin Li, Zhixin Ma, Fengbin Zhu +3 more
This paper introduces the concept of 'Sleeper Attack,' demonstrating that adversarial content can persist across multiple interactions with an LLM agent, posing a more subtle and difficult-to-detect s…
The paper proposes Open-Book Benign Rewriting (OBBR), a novel defense mechanism that uses LLM rewriting with benign samples to neutralize data poisoning attacks against LLMs, significantly improving s…
Bowen Sun, Chaozhuo Li, Yaodong Yang, Yiwei Wang +1 more
TwinGate introduces a stateful dual-encoder defense framework using Asymmetric Contrastive Learning to detect malicious intent from fragmented, untraceable LLM queries with high recall and low false p…
The paper introduces PsychoPass, a framework that analyzes the geometric trajectory of multi-turn conversations in embedding space to detect adversarial intent early, before harmful content is generat…
Yuhui Wang, Tanqiu Jiang, Jiacheng Liang, Charles Fleming +1 more
The paper introduces MAGE, a novel defensive framework that uses a dedicated 'shadow memory' to proactively detect and mitigate long-horizon threats against LLM agents during complex, multi-step inter…
The paper introduces a novel multi-turn jailbreaking method that exploits the vulnerability of safe completion models by gradually building conversational trust, and it also uncovers a new vulnerabili…
Wenjie Xiao, Xuehai Tang, Biyu Zhou, Songlin Hu +1 more
RouteGuard is a novel detector that identifies skill poisoning in LLM agents by monitoring structured internal attention shifts, achieving high detection rates on critical skill-injection attacks.
Kaisheng Fan, Weizhe Zhang, Yishu Gao, Tegawendé F. Bissyandé +1 more
The paper introduces Tail-risk Intrinsic Geometric Smoothing (TIGS), a plug-and-play, inference-time defense that suppresses backdoor attacks on LLMs by structurally smoothing the attention mechanism…
SentGuard introduces a novel sentence-level streaming guardrail that operates in parallel with LLM generation, achieving high detection rates of unsafe content early in the response while maintaining…
The paper introduces THREAT, a novel reasoning-driven framework that efficiently discovers highly effective and targeted jailbreak prompts for LLMs, revealing previously unknown safety vulnerabilities…