Papers similar to 2606.05609v1

~ similar to 2606.05609v1· 19 results

cs.CRRecentMay 4, 2026

Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

The paper introduces a new adaptive jailbreak attack (JB-GCG) that successfully bypasses the state-of-the-art JBShield defense, and proposes a more robust defense (RTV) based on multi-layer representa…

View →

cs.CRcs.AIRecentMay 6, 2026

SoK: Robustness in Large Language Models against Jailbreak Attacks

Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang +8 more

This paper introduces Security Cube, a comprehensive, multi-dimensional framework for evaluating LLM robustness against jailbreak attacks, providing a systematic taxonomy and benchmark analysis of exi…

View →

cs.CRcs.AIcs.LGRecentMay 9, 2026

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala +2 more

This paper addresses the lack of systematic infrastructure for evaluating jailbreak attacks by introducing a large-scale dataset, an automated generation method, and a continuous evaluation metric tha…

View →

cs.CRcs.AIRecentMar 28, 2026

GUARD-SLM: Token Activation-Based Defense Against Jailbreak Attacks for Small Language Models

Md Jueal Mia, Joaquin Molto, Yanzhao Wu, M. Hadi Amini

The paper proposes GUARD-SLM, a token activation-based defense mechanism, to enhance the robustness of Small Language Models (SLMs) against various jailbreak attacks by analyzing and filtering malicio…

View →

cs.CRcs.AIRecentMay 10, 2026

MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks

Xinkai Zhang, Zhipeng Wei, Huanli Gong, Jing Ting Zheng +3 more

The paper introduces MT-JailBench, a modular framework for evaluating multi-turn jailbreaks, demonstrating that controlling experimental components like prompt generation and resource budgets is cruci…

View →

cs.CRcs.AIRecentMay 11, 2026

Re-Triggering Safeguards within LLMs for Jailbreak Detection

Zheng Lin, Zhenxing Niu, Haoxuan Ji, Yuzhe Huang +1 more

The paper introduces an embedding disruption method to re-activate and strengthen built-in safeguards within LLMs, effectively detecting and defending against sophisticated jailbreak attacks.

View →

cs.CRcs.AIcs.CLRecentApr 13, 2026

The Salami Slicing Threat: Exploiting Cumulative Risks in LLM Systems

Yihao Zhang, Kai Wang, Jiangrong Wu, Haolin Wu +6 more

The paper introduces Salami Slicing Risk, a novel multi-turn jailbreak technique that accumulates harmful intent through numerous low-risk inputs, achieving state-of-the-art attack success rates again…

View →

cs.CRcs.AIRecentMay 14, 2026

The Great Pretender: A Stochasticity Problem in LLM Jailbreak

Jean-Philippe Monteuuis, Cong Chen, Jonathan Petit

The paper argues that the standard Attack Success Rate (ASR) metric for LLM jailbreaks is unstable and systematically inflated, proposing new frameworks to account for stochasticity in both evaluation…

View →

cs.CRcs.SERecentMay 15, 2026

Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs

Reinelle Jan Bugnot, Soohyeon Choi, Hoon Wei Lim, Yue Duan

This paper systematically analyzes the interaction of multiple weak jailbreak attacks (mutators) applied sequentially to LLMs, finding that most combinations fail due to destructive interference, reve…

View →

cs.CRcs.AIRecentMay 19, 2026

Exploring and Developing a Pre-Model Safeguard with Draft Models

Hongyu Cai, Arjun Arunasalam, Yiming Liang, Antonio Bianchi +1 more

The paper proposes a novel pre-model safeguard that uses small draft models (SLMs) to predict the safety of prompts, significantly reducing false-negative rates while maintaining low computational ove…

View →

cs.AIcs.CRRecentJun 4, 2026

GuardNet: Ensemble Strategies of Shallow Neural Networks for Robust Prompt Injection and Jailbreak Detection

Paulo Ricardo Ferreira Neves, Edson Rodrigues da Cruz Filho, Paulo Henrique Eleuterio Falsetti, João Vitor Pavan +6 more

GuardNet is a lightweight, ensemble-based guardrail system using shallow neural networks that provides robust and efficient detection of Prompt Injection and Jailbreak attacks on LLMs, suitable for pr…

View →

cs.SDcs.AIcs.CLRecentMay 28, 2026

Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation

Bo-Han Feng, Yu-Hsuan Li Liang, Chien-Feng Liu, You-Hsuan Chang +1 more

This paper provides a unified taxonomy and controlled empirical evaluation of jailbreak attacks and defenses for Large Audio Language Models (LALMs), demonstrating that safety evaluation must consider…

View →

cs.CRRecentMay 20, 2026

Adversarial Reframing: A Framework for Targeted Generation in Language Models

Shahnewaz Karim Sakib, Swati Kar, Anindya Bijoy Das

The paper introduces THREAT, a novel reasoning-driven framework that efficiently discovers highly effective and targeted jailbreak prompts for LLMs, revealing previously unknown safety vulnerabilities…

View →

cs.CRcs.AIRecentJun 1, 2026

MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models

Yingzi Ma, Zhengyue Zhao, Xiaogeng Liu, Minhui Xue +2 more

MaskForge is a novel, adaptive, black-box attack framework that significantly improves jailbreaking diffusion large language models (dLLMs) by treating red-teaming as an optimized search over reusable…

View →

cs.CRcs.LGRecentMay 23, 2026

Steering Beyond the Support: Adversarial Training on Unsupervised Jailbroken Activation Simulation

Luoyu Chen, Weiqi Wang, Zhiyi Tian, Chenhan Zhang +4 more

The paper proposes an unsupervised bi-level adversarial training framework to enhance LLM safety steering, achieving strong zero-shot defense against unseen and evolving jailbreak prompts.

View →

cs.CRRecentMay 28, 2026

Evolving Skill-Structured Attack Memory Enhances LLM Jailbreaking

Junke Zhang, Jianwei Wang, Sishuo Chen, Yizhang He +2 more

The paper proposes MemoAttack, a memory-driven black-box jailbreak framework that systematically models, evolves, and selects attack experiences to significantly enhance LLM jailbreaking success rates…

View →

cs.CRcs.CLRecentMay 1, 2026

SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking

Jindong Li, Ying Liu, Yali Fu, Jinjing Zhu +3 more

The paper proposes SRTJ, a Self-Evolving Rule-Driven Training-Free Jailbreak framework that systematically discovers and refines attack strategies using rule composition and feedback to achieve robust…

View →

cs.CRcs.AIcs.CLRecentMay 6, 2026

Sparse Tokens Suffice: Jailbreaking Audio Language Models via Token-Aware Gradient Optimization

Zheng Fang, Xiaosen Wang, Shenyi Zhang, Shaokang Wang +1 more

The paper introduces Token-Aware Gradient Optimization (TAGO), demonstrating that sparse optimization focusing only on high-gradient audio tokens is sufficient for effective jailbreaking of audio lang…

View →

cs.CRcs.AIRecentMay 16, 2026

New Wide-Net-Casting Jailbreak Attacks Risk Large Models

Qiuchi Xiang, Haoxuan Qu, Hossein Rahmani, Jun Liu

This paper introduces the 'wide-net-casting' jailbreak scenario, demonstrating that querying a group of large language models can expose significant, previously overlooked safety risks, with a novel m…

View →