Papers similar to 2605.17128v1

~ similar to 2605.17128v1· 19 results

cs.CRcs.AIRecentMay 10, 2026

MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks

Xinkai Zhang, Zhipeng Wei, Huanli Gong, Jing Ting Zheng +3 more

The paper introduces MT-JailBench, a modular framework for evaluating multi-turn jailbreaks, demonstrating that controlling experimental components like prompt generation and resource budgets is cruci…

View →

cs.CRcs.AIcs.LGRecentMay 26, 2026

Jailbreak susceptibility prediction and mitigation via the behavioral geometry of models

Hayden Helm, Xiaodong Liu, Weiwei Yang

The paper introduces a framework using the 'behavioral geometry' of model populations to efficiently predict jailbreak susceptibility and transfer defenses, achieving high accuracy with significantly…

View →

cs.CRcs.AIRecentMay 6, 2026

SoK: Robustness in Large Language Models against Jailbreak Attacks

Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang +8 more

This paper introduces Security Cube, a comprehensive, multi-dimensional framework for evaluating LLM robustness against jailbreak attacks, providing a systematic taxonomy and benchmark analysis of exi…

View →

cs.CRcs.AIcs.CLRecentApr 13, 2026

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Yihao Zhang, Kai Wang, Jiangrong Wu, Haolin Wu +6 more

The paper introduces Salami Slicing Risk, a novel multi-turn jailbreak technique that accumulates harmful intent through numerous low-risk inputs, achieving state-of-the-art attack success rates again…

View →

cs.CRcs.AIRecentMay 9, 2026

Single-Configuration Attack Success Rate Is Not Enough: Jailbreak Evaluations Should Report Distributional Attack Success

Carsten Maple, Abhishek Kumar, Riya Tapwal

This paper argues that reporting only the best-case attack success rate for jailbreaks is insufficient, proposing new distributional metrics (VSM and UC) to better characterize the true threat posed b…

View →

cs.CRcs.AIRecentMay 19, 2026

Exploring and Developing a Pre-Model Safeguard with Draft Models

Hongyu Cai, Arjun Arunasalam, Yiming Liang, Antonio Bianchi +1 more

The paper proposes a novel pre-model safeguard that uses small draft models (SLMs) to predict the safety of prompts, significantly reducing false-negative rates while maintaining low computational ove…

View →

cs.CRcs.AIcs.LGRecentMay 9, 2026

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala +2 more

This paper addresses the lack of systematic infrastructure for evaluating jailbreak attacks by introducing a large-scale dataset, an automated generation method, and a continuous evaluation metric tha…

View →

cs.CRcs.AIRecentMay 8, 2026

Mitigating Many-shot Jailbreak Attacks with One Single Demonstration

Kejia Chen, Jiawen Zhang, Boheng Li, Pengcheng Li +5 more

The paper proposes mitigating the progressive degradation of safety in language models caused by many-shot jailbreak attacks by appending a single, fixed safety demonstration at inference time.

View →

cs.CRcs.SERecentMay 15, 2026

Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs

Reinelle Jan Bugnot, Soohyeon Choi, Hoon Wei Lim, Yue Duan

This paper systematically analyzes the interaction of multiple weak jailbreak attacks (mutators) applied sequentially to LLMs, finding that most combinations fail due to destructive interference, reve…

View →

cs.CRRecentMay 4, 2026

Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

Kemal Derya, Berk Sunar

The paper introduces a new adaptive jailbreak attack (JB-GCG) that successfully bypasses the state-of-the-art JBShield defense, and proposes a more robust defense (RTV) based on multi-layer representa…

View →

cs.CRcs.AIRecentApr 18, 2026

SafeDream: Safety World Model for Proactive Early Jailbreak Detection

Bo Yan, Weikai Lin, Yada Zhu, Song Wang

SAFEDREAM introduces a lightweight, external world-model framework that proactively detects multi-turn jailbreak attacks by modeling cumulative safety erosion and predicting early failure points.

View →

cs.CVcs.AIcs.CLRecentMay 27, 2026

When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more

The paper investigates multimodal jailbreak robustness across various reasoning paradigms and finds that explicit image-tool interaction significantly improves safety by shifting the model's internal…

View →

cs.CVcs.AIcs.CLRecentMay 27, 2026

When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more

View →

cs.CRcs.LGRecentMay 23, 2026

Steering Beyond the Support: Adversarial Training on Unsupervised Jailbroken Activation Simulation

Luoyu Chen, Weiqi Wang, Zhiyi Tian, Chenhan Zhang +4 more

The paper proposes an unsupervised bi-level adversarial training framework to enhance LLM safety steering, achieving strong zero-shot defense against unseen and evolving jailbreak prompts.

View →

cs.CRcs.AIRecentMay 14, 2026

The Great Pretender: A Stochasticity Problem in LLM Jailbreak

Jean-Philippe Monteuuis, Cong Chen, Jonathan Petit

The paper argues that the standard Attack Success Rate (ASR) metric for LLM jailbreaks is unstable and systematically inflated, proposing new frameworks to account for stochasticity in both evaluation…

View →

cs.LGcs.AIcs.CRRecentApr 30, 2026

Jailbroken Frontier Models Retain Their Capabilities

Daniel Zhu, Zihan Wang, Xuchan Bao, Jerry Wei

The paper demonstrates that advanced jailbreaks do not impose a significant 'jailbreak tax' on highly capable frontier language models, retaining near-native performance.

View →

cs.CRcs.AIRecentMay 13, 2026

Quantifying LLM Safety Degradation Under Repeated Attacks Using Survival Analysis

Zvi Topol

The paper introduces a novel survival analysis framework to quantify how LLM safety degrades over repeated adversarial attacks, revealing distinct vulnerability profiles among tested models.

View →

cs.SDcs.AIcs.CLRecentMay 28, 2026

Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation

Bo-Han Feng, Yu-Hsuan Li Liang, Chien-Feng Liu, You-Hsuan Chang +1 more

This paper provides a unified taxonomy and controlled empirical evaluation of jailbreak attacks and defenses for Large Audio Language Models (LALMs), demonstrating that safety evaluation must consider…

View →

cs.CRcs.AIRecentMay 11, 2026

Re-Triggering Safeguards within LLMs for Jailbreak Detection

Zheng Lin, Zhenxing Niu, Haoxuan Ji, Yuzhe Huang +1 more

The paper introduces an embedding disruption method to re-activate and strengthen built-in safeguards within LLMs, effectively detecting and defending against sophisticated jailbreak attacks.

View →