~ similar to 2606.03136v1· 20 results
The paper introduces 'adversarial restlessness,' an activation-level signature in LLM residual streams, to detect multi-turn prompt injection attacks with high accuracy.
Xinjie Shen, Rongzhe Wei, Peizhi Niu, Haoyu Wang +5 more
The paper introduces TurnGate, a response-aware defense mechanism that detects the earliest turn in a multi-turn dialogue where the accumulated interaction enables a harmful action, significantly impr…
Zhiqing Ma, Zhonghao Xu, Dong Yu, Chen Kang +2 more
THRD introduces a novel, training-free framework that models temporal risk accumulation to effectively defend against multi-turn jailbreak attacks on LLMs, significantly reducing attack success rates…
The paper introduces a framework using the 'behavioral geometry' of model populations to efficiently predict jailbreak susceptibility and transfer defenses, achieving high accuracy with significantly…
This paper introduces MultiTurnPSB, a multi-turn adversarial benchmark, demonstrating that the safety of medical AI chatbots degrades significantly under sustained, real-world adversarial prompting, r…
The paper introduces THREAT, a novel reasoning-driven framework that efficiently discovers highly effective and targeted jailbreak prompts for LLMs, revealing previously unknown safety vulnerabilities…
Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala +2 more
This paper addresses the lack of systematic infrastructure for evaluating jailbreak attacks by introducing a large-scale dataset, an automated generation method, and a continuous evaluation metric tha…
The paper introduces a new adaptive jailbreak attack (JB-GCG) that successfully bypasses the state-of-the-art JBShield defense, and proposes a more robust defense (RTV) based on multi-layer representa…
Xinkai Zhang, Zhipeng Wei, Huanli Gong, Jing Ting Zheng +3 more
The paper introduces MT-JailBench, a modular framework for evaluating multi-turn jailbreaks, demonstrating that controlling experimental components like prompt generation and resource budgets is cruci…
The paper introduces Multi-Clip Video (MCV) SafetyBench, a dataset demonstrating that the vulnerability of Multimodal Large Language Models (MLLMs) to jailbreaking increases with the diversity and num…
Bowen Sun, Chaozhuo Li, Yaodong Yang, Yiwei Wang +1 more
TwinGate introduces a stateful dual-encoder defense framework using Asymmetric Contrastive Learning to detect malicious intent from fragmented, untraceable LLM queries with high recall and low false p…
The paper introduces a novel multi-turn jailbreaking method that exploits the vulnerability of safe completion models by gradually building conversational trust, and it also uncovers a new vulnerabili…
Yuanbo Zhou, Changjia Zhu, Junyu Wang, Xu He +4 more
The paper introduces the Prompt Overflow Attack, demonstrating that guardrail models inspecting truncated or segmented inputs fail to detect malicious instructions that are only actionable when the fu…
Paulo Ricardo Ferreira Neves, Edson Rodrigues da Cruz Filho, Paulo Henrique Eleuterio Falsetti, João Vitor Pavan +6 more
GuardNet is a lightweight, ensemble-based guardrail system using shallow neural networks that provides robust and efficient detection of Prompt Injection and Jailbreak attacks on LLMs, suitable for pr…
AttackEval systematically evaluates the effectiveness of 250 prompt injection prompts across ten attack categories, finding that composite and obfuscation attacks are highly effective against current…
ContextualJailbreak introduces an evolutionary red-teaming strategy that performs automated search over simulated multi-turn primed dialogues, achieving high jailbreak rates across multiple state-of-t…
The paper proposes the Triple-tier Anomaly Defense (TRIAD) framework, a predictive model that treats safety verification as a dynamic trajectory problem to detect cumulative, cross-modal poisoning in…
The paper introduces TraceSafe-Bench, a comprehensive benchmark, and finds that securing LLM agents requires jointly optimizing for structural reasoning and safety alignment to mitigate risks during m…
Lixing Lin, Juli You, Yue Li, Luyun Lin +3 more
Reflect-Guard enhances LLM safety classifiers by integrating logical self-reflection, significantly improving detection of sophisticated adversarial jailbreak prompts.
The paper introduces Probe-Geometry Alignment (PGA), a surgical method that removes the measurable cross-sequence memorization signature from large language models without degrading their general capa…