ArXivCSExplorer
☆☆Bookmarks🏆RSSHow to UseFAQ
Built with and by Teycir Ben Soltane•
How to Use•FAQ•GitHub•arXiv.org•
Share:

~ similar to 2605.07414v1· 20 results

cs.CRcs.AIRecentMay 10, 2026

MT-JailBench: A Modular Benchmark for Understanding Multi-Turn Jailbreak Attacks

Xinkai Zhang, Zhipeng Wei, Huanli Gong, Jing Ting Zheng +3 more

The paper introduces MT-JailBench, a modular framework for evaluating multi-turn jailbreaks, demonstrating that controlling experimental components like prompt generation and resource budgets is cruci…

View →
cs.CVcs.AIcs.CLRecentMay 27, 2026

When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more

The paper investigates multimodal jailbreak robustness across various reasoning paradigms and finds that explicit image-tool interaction significantly improves safety by shifting the model's internal…

View →
cs.CVcs.AIcs.CLRecentMay 27, 2026

When Think-with-Image Meets Safety: What Determines Multimodal Jailbreak Robustness?

Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more

The paper investigates multimodal jailbreak robustness across various reasoning paradigms and finds that explicit image-tool interaction significantly improves safety by guiding the model's internal r…

View →
cs.CRcs.SERecentMay 15, 2026

Compositional Jailbreaking: An Empirical Analysis of Mutator Chain Interactions in Aligned LLMs

Reinelle Jan Bugnot, Soohyeon Choi, Hoon Wei Lim, Yue Duan

This paper systematically analyzes the interaction of multiple weak jailbreak attacks (mutators) applied sequentially to LLMs, finding that most combinations fail due to destructive interference, reve…

View →
cs.CRcs.CLRecentMay 1, 2026

SRTJ: Self-Evolving Rule-Driven Training-Free LLM Jailbreaking

Jindong Li, Ying Liu, Yali Fu, Jinjing Zhu +3 more

The paper proposes SRTJ, a Self-Evolving Rule-Driven Training-Free Jailbreak framework that systematically discovers and refines attack strategies using rule composition and feedback to achieve robust…

View →
cs.CRcs.AIcs.LGRecentMar 24, 2026

Not All Tokens Are Created Equal: Query-Efficient Jailbreak Fuzzing for LLMs

Wenyu Chen, Xiangtao Meng, Chuanchao Zang, Li Wang +5 more

The paper proposes TriageFuzz, a token-aware fuzzing framework that significantly reduces the number of queries needed to jailbreak LLMs while maintaining high attack success rates.

View →
cs.CRcs.AIRecentMar 20, 2026

Evolving Jailbreaks: Automated Multi-Objective Long-Tail Attacks on Large Language Models

Wenjing Hong, Zhonghua Rong, Li Wang, Feng Chang +4 more

The paper introduces EvoJail, an automated multi-objective evolutionary framework that systematically discovers diverse and effective long-tail jailbreak attacks against LLMs by optimizing for attack…

View →
cs.CRcs.AIcs.LGRecentMay 9, 2026

The Art of the Jailbreak: Formulating Jailbreak Attacks for LLM Security Beyond Binary Scoring

Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala +2 more

This paper addresses the lack of systematic infrastructure for evaluating jailbreak attacks by introducing a large-scale dataset, an automated generation method, and a continuous evaluation metric tha…

View →
cs.CRcs.AIRecentMay 6, 2026

SoK: Robustness in Large Language Models against Jailbreak Attacks

Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang +8 more

This paper introduces Security Cube, a comprehensive, multi-dimensional framework for evaluating LLM robustness against jailbreak attacks, providing a systematic taxonomy and benchmark analysis of exi…

View →
cs.CRcs.AIRecentMay 18, 2026

Babel: Jailbreaking Safety Attention via Obfuscation Distribution Optimized Sampling

Ziwei Wang, Jing Chen, Ruichao Liang, Zhi Wang +5 more

The paper introduces Babel, an efficient black-box attack framework that systematically exploits intrinsic safety gaps in LLMs by optimizing text obfuscation sampling, achieving state-of-the-art jailb…

View →
cs.CRcs.AIcs.SERecentApr 14, 2026

TEMPLATEFUZZ: Fine-Grained Chat Template Fuzzing for Jailbreaking and Red Teaming LLMs

Qingchao Shen, Zibo Xiao, Lili Huang, Enwei Hu +2 more

TEMPLATEFUZZ is a fine-grained fuzzing framework that systematically tests chat templates to find vulnerabilities in LLMs, achieving high jailbreak success rates with minimal performance degradation.

View →
cs.CRcs.AIRecentMay 16, 2026

New Wide-Net-Casting Jailbreak Attacks Risk Large Models

Qiuchi Xiang, Haoxuan Qu, Hossein Rahmani, Jun Liu

This paper introduces the 'wide-net-casting' jailbreak scenario, demonstrating that querying a group of large language models can expose significant, previously overlooked safety risks, with a novel m…

View →
cs.CRcs.AIRecentMay 31, 2026

D-Judge: Disrupting Multi-Turn Jailbreaks using Semantics-Preserving Output Rewriting

Huanli Gong, Zhipeng Wei, Yu Fu, Haz Sameen Shahgir +3 more

D-Judge introduces a semantics-preserving output rewriting defense that disrupts multi-turn jailbreak attacks by misaligning the feedback signal used by an attacker's judge model.

View →
cs.CRcs.AIRecentMay 19, 2026

Exploring and Developing a Pre-Model Safeguard with Draft Models

Hongyu Cai, Arjun Arunasalam, Yiming Liang, Antonio Bianchi +1 more

The paper proposes a novel pre-model safeguard that uses small draft models (SLMs) to predict the safety of prompts, significantly reducing false-negative rates while maintaining low computational ove…

View →
cs.CRRecentJun 1, 2026

Benign Inputs, Harmful Outputs: Cross-Modal Jailbreaking via Distributed Semantic Recomposition

Yani Wang, Yilong Yang, Yang Liu, Zhuzhu Wang +2 more

The paper introduces Distributed Semantic Recomposition (DSR), a novel cross-modal jailbreaking framework that bypasses existing safety filters by decomposing harmful intent into benign input componen…

View →
cs.CRRecentMay 29, 2026

TRACE: Task-Aware Adaptive Self-Evolving Agentic Jailbreaking

Churui Zeng, Weiwei Qi, Kedong Xiu, Tianhang Zheng +4 more

The paper proposes TRACE, a novel agentic jailbreaking framework that successfully bypasses safety mechanisms of advanced LLM agents by decomposing malicious tasks and disguising harmful subtasks with…

View →
cs.CRcs.AIcs.CLRecentApr 22, 2026

Breaking MCP with Function Hijacking Attacks: Novel Threats for Function Calling and Agentic Models

Yannis Belkhiter, Giulio Zizzo, Sergio Maffeis, Seshu Tirupathi +1 more

This paper introduces a novel Function Hijacking Attack (FHA) that manipulates the tool selection process of agentic models, demonstrating a robust and context-agnostic threat to function calling LLMs…

View →
cs.CRcs.AIcs.MMRecentMar 23, 2026

Structured Visual Narratives Undermine Safety Alignment in Multimodal Large Language Models

Rui Yang Tan, Yujia Hu, Roy Ka-Wei Lee

This paper introduces ComicJailbreak, a new benchmark demonstrating that structured visual narratives can effectively jailbreak Multimodal Large Language Models (MLLMs), requiring new safety alignment…

View →
cs.CRcs.LGRecentApr 22, 2026

Breaking Bad: Interpretability-Based Safety Audits of State-of-the-Art LLMs

Krishiv Agarwal, Ramneet Kaur, Colin Samplawski, Manoj Acharya +5 more

The paper conducts an interpretability-driven safety audit of eight state-of-the-art LLMs, demonstrating that while interpretability-based steering is a powerful auditing tool, model robustness varies…

View →
cs.CRcs.AIRecentMay 11, 2026

Re-Triggering Safeguards within LLMs for Jailbreak Detection

Zheng Lin, Zhenxing Niu, Haoxuan Ji, Yuzhe Huang +1 more

The paper introduces an embedding disruption method to re-activate and strengthen built-in safeguards within LLMs, effectively detecting and defending against sophisticated jailbreak attacks.

View →