~ similar to 2605.08277v1· 20 results
This paper systematically analyzes the interaction of multiple weak jailbreak attacks (mutators) applied sequentially to LLMs, finding that most combinations fail due to destructive interference, reve…
Xinkai Zhang, Zhipeng Wei, Huanli Gong, Jing Ting Zheng +3 more
The paper introduces MT-JailBench, a modular framework for evaluating multi-turn jailbreaks, demonstrating that controlling experimental components like prompt generation and resource budgets is cruci…
Huanli Gong, Zhipeng Wei, Yu Fu, Haz Sameen Shahgir +3 more
D-Judge introduces a semantics-preserving output rewriting defense that disrupts multi-turn jailbreak attacks by misaligning the feedback signal used by an attacker's judge model.
Jindong Li, Ying Liu, Yali Fu, Jinjing Zhu +3 more
The paper proposes SRTJ, a Self-Evolving Rule-Driven Training-Free Jailbreak framework that systematically discovers and refines attack strategies using rule composition and feedback to achieve robust…
The paper introduces Incremental Completion Decomposition (ICD), a novel jailbreak strategy that successfully bypasses LLM safety mechanisms by eliciting malicious content through a sequence of single…
Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala +2 more
This paper addresses the lack of systematic infrastructure for evaluating jailbreak attacks by introducing a large-scale dataset, an automated generation method, and a continuous evaluation metric tha…
Luoyu Chen, Weiqi Wang, Zhiyi Tian, Chenhan Zhang +4 more
The paper proposes an unsupervised bi-level adversarial training framework to enhance LLM safety steering, achieving strong zero-shot defense against unseen and evolving jailbreak prompts.
Luoyu Chen, Weiqi Wang, Zhiyi Tian, Feng Wu +2 more
The paper proposes Ellipsoid Control, a white-list defense mechanism that uses benign data geometry to constrain model updates, thereby enhancing jailbreak safety while preserving the utility of harml…
Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang +8 more
This paper introduces Security Cube, a comprehensive, multi-dimensional framework for evaluating LLM robustness against jailbreak attacks, providing a systematic taxonomy and benchmark analysis of exi…
This paper introduces the 'wide-net-casting' jailbreak scenario, demonstrating that querying a group of large language models can expose significant, previously overlooked safety risks, with a novel m…
Hongyu Cai, Arjun Arunasalam, Yiming Liang, Antonio Bianchi +1 more
The paper proposes a novel pre-model safeguard that uses small draft models (SLMs) to predict the safety of prompts, significantly reducing false-negative rates while maintaining low computational ove…
The paper introduces a new adaptive jailbreak attack (JB-GCG) that successfully bypasses the state-of-the-art JBShield defense, and proposes a more robust defense (RTV) based on multi-layer representa…
Zheng Lin, Zhenxing Niu, Haoxuan Ji, Yuzhe Huang +1 more
The paper introduces an embedding disruption method to re-activate and strengthen built-in safeguards within LLMs, effectively detecting and defending against sophisticated jailbreak attacks.
The paper introduces a novel multi-turn jailbreaking method that exploits the vulnerability of safe completion models by gradually building conversational trust, and it also uncovers a new vulnerabili…
Yihao Zhang, Kai Wang, Jiangrong Wu, Haolin Wu +6 more
The paper introduces Salami Slicing Risk, a novel multi-turn jailbreak technique that accumulates harmful intent through numerous low-risk inputs, achieving state-of-the-art attack success rates again…
This paper argues that reporting only the best-case attack success rate for jailbreaks is insufficient, proposing new distributional metrics (VSM and UC) to better characterize the true threat posed b…
Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more
The paper investigates multimodal jailbreak robustness across various reasoning paradigms and finds that explicit image-tool interaction significantly improves safety by shifting the model's internal…
Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more
The paper investigates multimodal jailbreak robustness across various reasoning paradigms and finds that explicit image-tool interaction significantly improves safety by guiding the model's internal r…
SAFEDREAM introduces a lightweight, external world-model framework that proactively detects multi-turn jailbreak attacks by modeling cumulative safety erosion and predicting early failure points.
The paper introduces Persona Attack, a novel memory injection jailbreak method that demonstrates that accumulating instructions in the model's context window can override internal safety alignments, a…