~ similar to 2605.26409v1· 20 results
This paper introduces the 'wide-net-casting' jailbreak scenario, demonstrating that querying a group of large language models can expose significant, previously overlooked safety risks, with a novel m…
Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala +2 more
This paper addresses the lack of systematic infrastructure for evaluating jailbreak attacks by introducing a large-scale dataset, an automated generation method, and a continuous evaluation metric tha…
This paper argues that reporting only the best-case attack success rate for jailbreaks is insufficient, proposing new distributional metrics (VSM and UC) to better characterize the true threat posed b…
This paper systematically analyzes the interaction of multiple weak jailbreak attacks (mutators) applied sequentially to LLMs, finding that most combinations fail due to destructive interference, reve…
Xinkai Zhang, Zhipeng Wei, Huanli Gong, Jing Ting Zheng +3 more
The paper introduces MT-JailBench, a modular framework for evaluating multi-turn jailbreaks, demonstrating that controlling experimental components like prompt generation and resource budgets is cruci…
The paper investigates how different methods of jailbreaking large language models (SFT, RLVR, and abliteration) lead to vastly different behavioral and mechanistic failures, even when all methods ach…
Luoyu Chen, Weiqi Wang, Zhiyi Tian, Chenhan Zhang +4 more
The paper proposes an unsupervised bi-level adversarial training framework to enhance LLM safety steering, achieving strong zero-shot defense against unseen and evolving jailbreak prompts.
Hongyu Cai, Arjun Arunasalam, Yiming Liang, Antonio Bianchi +1 more
The paper proposes a novel pre-model safeguard that uses small draft models (SLMs) to predict the safety of prompts, significantly reducing false-negative rates while maintaining low computational ove…
The paper argues that the standard Attack Success Rate (ASR) metric for LLM jailbreaks is unstable and systematically inflated, proposing new frameworks to account for stochasticity in both evaluation…
The paper introduces Indirect Harm Optimization (IHO), a novel black-box, adaptive, and efficient attack method that significantly improves jailbreak success rates against LLMs, aiming to provide a st…
The paper introduces a new adaptive jailbreak attack (JB-GCG) that successfully bypasses the state-of-the-art JBShield defense, and proposes a more robust defense (RTV) based on multi-layer representa…
Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang +8 more
This paper introduces Security Cube, a comprehensive, multi-dimensional framework for evaluating LLM robustness against jailbreak attacks, providing a systematic taxonomy and benchmark analysis of exi…
Luoyu Chen, Weiqi Wang, Zhiyi Tian, Feng Wu +2 more
The paper proposes Ellipsoid Control, a white-list defense mechanism that uses benign data geometry to constrain model updates, thereby enhancing jailbreak safety while preserving the utility of harml…
Kejia Chen, Jiawen Zhang, Boheng Li, Pengcheng Li +5 more
The paper proposes mitigating the progressive degradation of safety in language models caused by many-shot jailbreak attacks by appending a single, fixed safety demonstration at inference time.
The paper demonstrates that advanced jailbreaks do not impose a significant 'jailbreak tax' on highly capable frontier language models, retaining near-native performance.
The paper introduces a novel survival analysis framework to quantify how LLM safety degrades over repeated adversarial attacks, revealing distinct vulnerability profiles among tested models.
SAFEDREAM introduces a lightweight, external world-model framework that proactively detects multi-turn jailbreak attacks by modeling cumulative safety erosion and predicting early failure points.
Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more
The paper investigates multimodal jailbreak robustness across various reasoning paradigms and finds that explicit image-tool interaction significantly improves safety by shifting the model's internal…
Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more
The paper investigates multimodal jailbreak robustness across various reasoning paradigms and finds that explicit image-tool interaction significantly improves safety by guiding the model's internal r…
The paper introduces Head-Masked Nullspace Steering (HMNS), a novel geometry-aware attack method that achieves state-of-the-art jailbreak success rates by manipulating the internal attention mechanism…