ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2604.27861v1· 19 results

cs.CRRecentMay 4, 2026

Revisiting JBShield: Breaking and Rebuilding Representation-Level Jailbreak Defenses

Kemal Derya, Berk Sunar

The paper introduces a new adaptive jailbreak attack (JB-GCG) that successfully bypasses the state-of-the-art JBShield defense, and proposes a more robust defense (RTV) based on multi-layer representa…

View →
cs.CLcs.CRRecentApr 1, 2026

One Word at a Time: Incremental Completion Decomposition Breaks LLM Safety

Samee Arif, Naihao Deng, Zhijing Jin, Rada Mihalcea

The paper introduces Incremental Completion Decomposition (ICD), a novel jailbreak strategy that successfully bypasses LLM safety mechanisms by eliciting malicious content through a sequence of single…

View →
cs.CRcs.AIRecentMay 6, 2026

SoK: Robustness in Large Language Models against Jailbreak Attacks

Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang +8 more

This paper introduces Security Cube, a comprehensive, multi-dimensional framework for evaluating LLM robustness against jailbreak attacks, providing a systematic taxonomy and benchmark analysis of exi…

View →
cs.CRcs.AIcs.LGRecentMay 9, 2026

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala +2 more

This paper addresses the lack of systematic infrastructure for evaluating jailbreak attacks by introducing a large-scale dataset, an automated generation method, and a continuous evaluation metric tha…

View →
cs.CRcs.AIRecentJun 1, 2026

MaskForge: Structure-Aware Adaptive Attacks for Jailbreaking Diffusion Large Language Models

Yingzi Ma, Zhengyue Zhao, Xiaogeng Liu, Minhui Xue +2 more

MaskForge is a novel, adaptive, black-box attack framework that significantly improves jailbreaking diffusion large language models (dLLMs) by treating red-teaming as an optimized search over reusable…

View →
cs.CRcs.AIRecentMar 28, 2026

GUARD-SLM: Token Activation-Based Defense Against Jailbreak Attacks for Small Language Models

Md Jueal Mia, Joaquin Molto, Yanzhao Wu, M. Hadi Amini

The paper proposes GUARD-SLM, a token activation-based defense mechanism, to enhance the robustness of Small Language Models (SLMs) against various jailbreak attacks by analyzing and filtering malicio…

View →
cs.CRcs.AIRecentMay 10, 2026

MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks

Xinkai Zhang, Zhipeng Wei, Huanli Gong, Jing Ting Zheng +3 more

The paper introduces MT-JailBench, a modular framework for evaluating multi-turn jailbreaks, demonstrating that controlling experimental components like prompt generation and resource budgets is cruci…

View →
cs.CRRecentMay 23, 2026

Ellipsoid Control: A White-list Jailbreak Defense via Benign Latent Modeling

Luoyu Chen, Weiqi Wang, Zhiyi Tian, Feng Wu +2 more

The paper proposes Ellipsoid Control, a white-list defense mechanism that uses benign data geometry to constrain model updates, thereby enhancing jailbreak safety while preserving the utility of harml…

View →
cs.CRcs.AIcs.LGRecentJun 4, 2026

SlotGCG: Exploiting the Positional Vulnerability in LLMs for Jailbreak Attacks

Seungwon Jeong, Jiwoo Jeong, Hyeonjin Kim, Yunseok Lee +1 more

The paper introduces SlotGCG, an improved jailbreak attack method that systematically searches for the most vulnerable token insertion positions (slots) within a prompt, significantly boosting attack…

View →
cs.CRcs.AIRecentMay 18, 2026

Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling

Ziwei Wang, Jing Chen, Ruichao Liang, Zhi Wang +5 more

The paper introduces Babel, an efficient black-box attack framework that systematically exploits intrinsic safety gaps in LLMs by optimizing text obfuscation sampling, achieving state-of-the-art jailb…

View →
cs.CRcs.SERecentMay 15, 2026

Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs

Reinelle Jan Bugnot, Soohyeon Choi, Hoon Wei Lim, Yue Duan

This paper systematically analyzes the interaction of multiple weak jailbreak attacks (mutators) applied sequentially to LLMs, finding that most combinations fail due to destructive interference, reve…

View →
cs.CLcs.AIRecentJun 1, 2026

THRD: A Training-Free Multi-Turn Defense Framework for Jailbreak Attacks on Large Language Models

Zhiqing Ma, Zhonghao Xu, Dong Yu, Chen Kang +2 more

THRD introduces a novel, training-free framework that models temporal risk accumulation to effectively defend against multi-turn jailbreak attacks on LLMs, significantly reducing attack success rates…

View →
cs.SDcs.AIcs.CLRecentMay 28, 2026

Audio Jailbreaks in Large Audio-Language Models: Taxonomy, Attack-Defense Analysis, and Cost-Aware Evaluation

Bo-Han Feng, Yu-Hsuan Li Liang, Chien-Feng Liu, You-Hsuan Chang +1 more

This paper provides a unified taxonomy and controlled empirical evaluation of jailbreak attacks and defenses for Large Audio Language Models (LALMs), demonstrating that safety evaluation must consider…

View →
cs.CRcs.CLRecentMay 1, 2026

SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking

Jindong Li, Ying Liu, Yali Fu, Jinjing Zhu +3 more

The paper proposes SRTJ, a Self-Evolving Rule-Driven Training-Free Jailbreak framework that systematically discovers and refines attack strategies using rule composition and feedback to achieve robust…

View →
cs.CRcs.AIRecentApr 11, 2026

Jailbreaking the Matrix: Nullspace Steering for Controlled Model Subversion

Vishal Pramanik, Maisha Maliha, Susmit Jha, Sumit Kumar Jha

The paper introduces Head-Masked Nullspace Steering (HMNS), a novel geometry-aware attack method that achieves state-of-the-art jailbreak success rates by manipulating the internal attention mechanism…

View →
cs.CRcs.AIRecentMay 8, 2026

Mitigating Many-shot Jailbreak Attacks with One Single Demonstration

Kejia Chen, Jiawen Zhang, Boheng Li, Pengcheng Li +5 more

The paper proposes mitigating the progressive degradation of safety in language models caused by many-shot jailbreak attacks by appending a single, fixed safety demonstration at inference time.

View →
cs.CRRecentJun 1, 2026

Benign Inputs, Harmful Outputs: Cross-Modal Jailbreaking via Distributed Semantic Recomposition

Yani Wang, Yilong Yang, Yang Liu, Zhuzhu Wang +2 more

The paper introduces Distributed Semantic Recomposition (DSR), a novel cross-modal jailbreaking framework that bypasses existing safety filters by decomposing harmful intent into benign input componen…

View →
cs.CRcs.AIcs.IRRecentJun 2, 2026

Patcher: Post-Hoc Patching of Backdoored Large Language Models

Anjun Gao, Yueyang Quan, Yufei Xia, Zhuqing Liu +1 more

Patcher is a post-hoc defense framework that repairs backdoored large language models by localizing hidden triggers and patching the model using only a single reported failure case.

View →
cs.CRRecentMay 20, 2026

Adversarial Reframing: A Framework for Targeted Generation in Language Models

Shahnewaz Karim Sakib, Swati Kar, Anindya Bijoy Das

The paper introduces THREAT, a novel reasoning-driven framework that efficiently discovers highly effective and targeted jailbreak prompts for LLMs, revealing previously unknown safety vulnerabilities…

View →