~ similar to 2605.09070v1· 19 results
The paper argues that the standard Attack Success Rate (ASR) metric for LLM jailbreaks is unstable and systematically inflated, proposing new frameworks to account for stochasticity in both evaluation…
Xinkai Zhang, Zhipeng Wei, Huanli Gong, Jing Ting Zheng +3 more
The paper introduces MT-JailBench, a modular framework for evaluating multi-turn jailbreaks, demonstrating that controlling experimental components like prompt generation and resource budgets is cruci…
This paper introduces the 'wide-net-casting' jailbreak scenario, demonstrating that querying a group of large language models can expose significant, previously overlooked safety risks, with a novel m…
Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang +8 more
This paper introduces Security Cube, a comprehensive, multi-dimensional framework for evaluating LLM robustness against jailbreak attacks, providing a systematic taxonomy and benchmark analysis of exi…
Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala +2 more
This paper addresses the lack of systematic infrastructure for evaluating jailbreak attacks by introducing a large-scale dataset, an automated generation method, and a continuous evaluation metric tha…
The paper introduces a framework using the 'behavioral geometry' of model populations to efficiently predict jailbreak susceptibility and transfer defenses, achieving high accuracy with significantly…
This paper systematically analyzes the interaction of multiple weak jailbreak attacks (mutators) applied sequentially to LLMs, finding that most combinations fail due to destructive interference, reve…
This paper provides a unified taxonomy and controlled empirical evaluation of jailbreak attacks and defenses for Large Audio Language Models (LALMs), demonstrating that safety evaluation must consider…
Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more
The paper investigates multimodal jailbreak robustness across various reasoning paradigms and finds that explicit image-tool interaction significantly improves safety by shifting the model's internal…
Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more
The paper investigates multimodal jailbreak robustness across various reasoning paradigms and finds that explicit image-tool interaction significantly improves safety by guiding the model's internal r…
The paper introduces a new adaptive jailbreak attack (JB-GCG) that successfully bypasses the state-of-the-art JBShield defense, and proposes a more robust defense (RTV) based on multi-layer representa…
The paper introduces a novel survival analysis framework to quantify how LLM safety degrades over repeated adversarial attacks, revealing distinct vulnerability profiles among tested models.
Kejia Chen, Jiawen Zhang, Boheng Li, Pengcheng Li +5 more
The paper proposes mitigating the progressive degradation of safety in language models caused by many-shot jailbreak attacks by appending a single, fixed safety demonstration at inference time.
The paper introduces Temporal Logit Observability (TLO), a training-free diagnostic that analyzes the decoding process to reveal the temporal patterns of LLM safety failures, showing that failure mech…
The paper introduces Indirect Harm Optimization (IHO), a novel black-box, adaptive, and efficient attack method that significantly improves jailbreak success rates against LLMs, aiming to provide a st…
Luoyu Chen, Weiqi Wang, Zhiyi Tian, Chenhan Zhang +4 more
The paper proposes an unsupervised bi-level adversarial training framework to enhance LLM safety steering, achieving strong zero-shot defense against unseen and evolving jailbreak prompts.
Yihao Zhang, Kai Wang, Jiangrong Wu, Haolin Wu +6 more
The paper introduces Salami Slicing Risk, a novel multi-turn jailbreak technique that accumulates harmful intent through numerous low-risk inputs, achieving state-of-the-art attack success rates again…
SAFEDREAM introduces a lightweight, external world-model framework that proactively detects multi-turn jailbreak attacks by modeling cumulative safety erosion and predicting early failure points.
The paper introduces a robust evaluation methodology, Gate AI, to accurately benchmark LLM security detectors by eliminating systematic weaknesses like per-dataset threshold tuning and undisclosed ope…