~ similar to 2604.24542v1· 18 results
Krishiv Agarwal, Ramneet Kaur, Colin Samplawski, Manoj Acharya +5 more
The paper conducts an interpretability-driven safety audit of eight state-of-the-art LLMs, demonstrating that while interpretability-based steering is a powerful auditing tool, model robustness varies…
Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala +2 more
This paper addresses the lack of systematic infrastructure for evaluating jailbreak attacks by introducing a large-scale dataset, an automated generation method, and a continuous evaluation metric tha…
Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang +8 more
This paper introduces Security Cube, a comprehensive, multi-dimensional framework for evaluating LLM robustness against jailbreak attacks, providing a systematic taxonomy and benchmark analysis of exi…
Weiyang Guo, Zesheng Shi, Zeen Zhu, Yuan Zhou +2 more
This paper introduces a novel backdoor attack (ACB) against Reinforcement Learning with Verifiable Rewards (RLVR), demonstrating that poisoning the training data can implant a backdoor that significan…
The paper introduces THREAT, a novel reasoning-driven framework that efficiently discovers highly effective and targeted jailbreak prompts for LLMs, revealing previously unknown safety vulnerabilities…
Hongyu Cai, Arjun Arunasalam, Yiming Liang, Antonio Bianchi +1 more
The paper proposes a novel pre-model safeguard that uses small draft models (SLMs) to predict the safety of prompts, significantly reducing false-negative rates while maintaining low computational ove…
The paper introduces FLIPS, an instance-level fingerprinting technique that exploits biases in generated random sequences to accurately distinguish between different configurations of the same Large L…
This paper provides a systematic, lifecycle-based framework for analyzing security threats and defenses across the entire fine-tuning process of LLMs, revealing that attack effectiveness is highly mod…
This paper provides a unified taxonomy and controlled empirical evaluation of jailbreak attacks and defenses for Large Audio Language Models (LALMs), demonstrating that safety evaluation must consider…
Yihao Zhang, Kai Wang, Jiangrong Wu, Haolin Wu +6 more
The paper introduces Salami Slicing Risk, a novel multi-turn jailbreak technique that accumulates harmful intent through numerous low-risk inputs, achieving state-of-the-art attack success rates again…
The paper introduces a robust evaluation methodology, Gate AI, to accurately benchmark LLM security detectors by eliminating systematic weaknesses like per-dataset threshold tuning and undisclosed ope…
The paper introduces Indirect Harm Optimization (IHO), a novel black-box, adaptive, and efficient attack method that significantly improves jailbreak success rates against LLMs, aiming to provide a st…
Lixing Lin, Juli You, Yue Li, Luyun Lin +3 more
Reflect-Guard enhances LLM safety classifiers by integrating logical self-reflection, significantly improving detection of sophisticated adversarial jailbreak prompts.
Jindong Li, Ying Liu, Yali Fu, Jinjing Zhu +3 more
The paper proposes SRTJ, a Self-Evolving Rule-Driven Training-Free Jailbreak framework that systematically discovers and refines attack strategies using rule composition and feedback to achieve robust…
SAFEDREAM introduces a lightweight, external world-model framework that proactively detects multi-turn jailbreak attacks by modeling cumulative safety erosion and predicting early failure points.
The paper introduces a new adaptive jailbreak attack (JB-GCG) that successfully bypasses the state-of-the-art JBShield defense, and proposes a more robust defense (RTV) based on multi-layer representa…
This paper systematically analyzes the interaction of multiple weak jailbreak attacks (mutators) applied sequentially to LLMs, finding that most combinations fail due to destructive interference, reve…
Rui Yin, Tianxu Han, Naen Xu, Changjiang Li +7 more
The paper proposes a novel method to inject reliable, sustained backdoors into LLMs by compiling an activation steering vector into model weights, ensuring the backdoor only activates upon a specific…