~ similar to 2606.05566v1· 19 results
The paper proposes GUARD-SLM, a token activation-based defense mechanism, to enhance the robustness of Small Language Models (SLMs) against various jailbreak attacks by analyzing and filtering malicio…
Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala +2 more
This paper addresses the lack of systematic infrastructure for evaluating jailbreak attacks by introducing a large-scale dataset, an automated generation method, and a continuous evaluation metric tha…
The paper introduces a robust evaluation methodology, Gate AI, to accurately benchmark LLM security detectors by eliminating systematic weaknesses like per-dataset threshold tuning and undisclosed ope…
Lixing Lin, Juli You, Yue Li, Luyun Lin +3 more
Reflect-Guard enhances LLM safety classifiers by integrating logical self-reflection, significantly improving detection of sophisticated adversarial jailbreak prompts.
The paper introduces a new adaptive jailbreak attack (JB-GCG) that successfully bypasses the state-of-the-art JBShield defense, and proposes a more robust defense (RTV) based on multi-layer representa…
Hongyu Cai, Arjun Arunasalam, Yiming Liang, Antonio Bianchi +1 more
The paper proposes a novel pre-model safeguard that uses small draft models (SLMs) to predict the safety of prompts, significantly reducing false-negative rates while maintaining low computational ove…
SelfGrader proposes a lightweight, robust guardrail for detecting LLM jailbreaks by formulating the detection problem as a numerical grading task using anchored token-level logits, achieving strong pe…
Yuanbo Zhou, Changjia Zhu, Junyu Wang, Xu He +4 more
The paper introduces the Prompt Overflow Attack, demonstrating that guardrail models inspecting truncated or segmented inputs fail to detect malicious instructions that are only actionable when the fu…
The paper introduces SecureBreak, a manually annotated, safety-oriented dataset designed to help detect harmful outputs from large language models (LLMs) that bypass existing security alignments.
This paper provides a unified taxonomy and controlled empirical evaluation of jailbreak attacks and defenses for Large Audio Language Models (LALMs), demonstrating that safety evaluation must consider…
Zheng Lin, Zhenxing Niu, Haoxuan Ji, Yuzhe Huang +1 more
The paper introduces an embedding disruption method to re-activate and strengthen built-in safeguards within LLMs, effectively detecting and defending against sophisticated jailbreak attacks.
The paper introduces GuardPhish, a large-scale dataset and evaluation framework, demonstrating that even high-performing open-source LLMs can generate actionable phishing content despite accurate inte…
The paper evaluates prompt-injection defenses for educational LLM tutors, demonstrating that optimal security requires balancing adversarial robustness, usability, and latency, and proposing a compreh…
The paper introduces THREAT, a novel reasoning-driven framework that efficiently discovers highly effective and targeted jailbreak prompts for LLMs, revealing previously unknown safety vulnerabilities…
Yingzi Ma, Zhengyue Zhao, Xiaogeng Liu, Minhui Xue +2 more
MaskForge is a novel, adaptive, black-box attack framework that significantly improves jailbreaking diffusion large language models (dLLMs) by treating red-teaming as an optimized search over reusable…
Xinkai Zhang, Zhipeng Wei, Huanli Gong, Jing Ting Zheng +3 more
The paper introduces MT-JailBench, a modular framework for evaluating multi-turn jailbreaks, demonstrating that controlling experimental components like prompt generation and resource budgets is cruci…
Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang +8 more
This paper introduces Security Cube, a comprehensive, multi-dimensional framework for evaluating LLM robustness against jailbreak attacks, providing a systematic taxonomy and benchmark analysis of exi…
SAFEDREAM introduces a lightweight, external world-model framework that proactively detects multi-turn jailbreak attacks by modeling cumulative safety erosion and predicting early failure points.
Li Huang, Zhongxin Liu, Yifan Wu, Tao Yin +5 more
DeepGuard introduces a novel multi-layer semantic aggregation framework to enhance secure code generation by collecting vulnerability cues from multiple upper layers of LLMs, significantly improving s…