~ similar to 2606.01322· 18 results
Hongyu Cai, Arjun Arunasalam, Yiming Liang, Antonio Bianchi +1 more
The paper proposes a novel pre-model safeguard that uses small draft models (SLMs) to predict the safety of prompts, significantly reducing false-negative rates while maintaining low computational ove…
Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang +8 more
This paper introduces Security Cube, a comprehensive, multi-dimensional framework for evaluating LLM robustness against jailbreak attacks, providing a systematic taxonomy and benchmark analysis of exi…
This paper introduces ComicJailbreak, a new benchmark demonstrating that structured visual narratives can effectively jailbreak Multimodal Large Language Models (MLLMs), requiring new safety alignment…
Ziwei Wang, Jing Chen, Ruichao Liang, Zhi Wang +5 more
The paper introduces Babel, an efficient black-box attack framework that systematically exploits intrinsic safety gaps in LLMs by optimizing text obfuscation sampling, achieving state-of-the-art jailb…
Xinkai Zhang, Zhipeng Wei, Huanli Gong, Jing Ting Zheng +3 more
The paper introduces MT-JailBench, a modular framework for evaluating multi-turn jailbreaks, demonstrating that controlling experimental components like prompt generation and resource budgets is cruci…
The paper introduces THREAT, a novel reasoning-driven framework that efficiently discovers highly effective and targeted jailbreak prompts for LLMs, revealing previously unknown safety vulnerabilities…
The paper introduces a robust evaluation methodology, Gate AI, to accurately benchmark LLM security detectors by eliminating systematic weaknesses like per-dataset threshold tuning and undisclosed ope…
The paper proposes GUARD-SLM, a token activation-based defense mechanism, to enhance the robustness of Small Language Models (SLMs) against various jailbreak attacks by analyzing and filtering malicio…
The paper theorizes that aligned LLMs remain jailbreakable due to 'Refusal-Escape Directions' (RED), which are continuous perturbation paths that shift model behavior from refusal to answering, and sh…
Krishiv Agarwal, Ramneet Kaur, Colin Samplawski, Manoj Acharya +5 more
The paper conducts an interpretability-driven safety audit of eight state-of-the-art LLMs, demonstrating that while interpretability-based steering is a powerful auditing tool, model robustness varies…
The paper introduces Multi-Clip Video (MCV) SafetyBench, a dataset demonstrating that the vulnerability of Multimodal Large Language Models (MLLMs) to jailbreaking increases with the diversity and num…
The paper introduces Incremental Completion Decomposition (ICD), a novel jailbreak strategy that successfully bypasses LLM safety mechanisms by eliciting malicious content through a sequence of single…
Wenyu Chen, Xiangtao Meng, Chuanchao Zang, Li Wang +5 more
The paper proposes TriageFuzz, a token-aware fuzzing framework that significantly reduces the number of queries needed to jailbreak LLMs while maintaining high attack success rates.
Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala +2 more
This paper addresses the lack of systematic infrastructure for evaluating jailbreak attacks by introducing a large-scale dataset, an automated generation method, and a continuous evaluation metric tha…
This paper systematically analyzes the interaction of multiple weak jailbreak attacks (mutators) applied sequentially to LLMs, finding that most combinations fail due to destructive interference, reve…
Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more
The paper investigates multimodal jailbreak robustness across various reasoning paradigms and finds that explicit image-tool interaction significantly improves safety by shifting the model's internal…
Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more
The paper investigates multimodal jailbreak robustness across various reasoning paradigms and finds that explicit image-tool interaction significantly improves safety by guiding the model's internal r…
This paper provides a unified taxonomy and controlled empirical evaluation of jailbreak attacks and defenses for Large Audio Language Models (LALMs), demonstrating that safety evaluation must consider…