~ similar to 2603.29038v1· 19 results
Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala +2 more
This paper addresses the lack of systematic infrastructure for evaluating jailbreak attacks by introducing a large-scale dataset, an automated generation method, and a continuous evaluation metric tha…
This paper provides a systematic, lifecycle-based framework for analyzing security threats and defenses across the entire fine-tuning process of LLMs, revealing that attack effectiveness is highly mod…
This paper demonstrates that existing open-weight LLM safeguards are vulnerable to simple, non-gradient-based attacks like abliteration and prefilling, significantly increasing the attack success rate…
Sangyeon Yoon, Wonje Jeung, Yoonjun Cho, Dongjae Jeon +1 more
The paper introduces a truly benign Direct Preference Optimization (DPO) attack that can jailbreak large language models (LLMs) by fine-tuning them with minimal, harmless preference data, thereby supp…
This paper systematically analyzes the interaction of multiple weak jailbreak attacks (mutators) applied sequentially to LLMs, finding that most combinations fail due to destructive interference, reve…
The paper introduces a new adaptive jailbreak attack (JB-GCG) that successfully bypasses the state-of-the-art JBShield defense, and proposes a more robust defense (RTV) based on multi-layer representa…
Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang +8 more
This paper introduces Security Cube, a comprehensive, multi-dimensional framework for evaluating LLM robustness against jailbreak attacks, providing a systematic taxonomy and benchmark analysis of exi…
Paulo Ricardo Ferreira Neves, Edson Rodrigues da Cruz Filho, Paulo Henrique Eleuterio Falsetti, João Vitor Pavan +6 more
GuardNet is a lightweight, ensemble-based guardrail system using shallow neural networks that provides robust and efficient detection of Prompt Injection and Jailbreak attacks on LLMs, suitable for pr…
The paper introduces THREAT, a novel reasoning-driven framework that efficiently discovers highly effective and targeted jailbreak prompts for LLMs, revealing previously unknown safety vulnerabilities…
Lixing Lin, Juli You, Yue Li, Luyun Lin +3 more
Reflect-Guard enhances LLM safety classifiers by integrating logical self-reflection, significantly improving detection of sophisticated adversarial jailbreak prompts.
This paper provides a unified taxonomy and controlled empirical evaluation of jailbreak attacks and defenses for Large Audio Language Models (LALMs), demonstrating that safety evaluation must consider…
Anjun Gao, Yueyang Quan, Yufei Xia, Zhuqing Liu +1 more
Patcher is a post-hoc defense framework that repairs backdoored large language models by localizing hidden triggers and patching the model using only a single reported failure case.
The paper demonstrates that current defenses against malicious fine-tuning of foundation models are insufficient because they only address fixed attacks, and introduces a unified adaptive attack that…
The paper proposes a novel safety fine-tuning method that uses the target model's own rollouts to identify and train on the hardest prompts, significantly reducing jailbreak success rates while mainta…
The paper establishes a standardized security assessment framework and develops a multi-layered defensive system, demonstrating that systematic testing and external defenses are crucial for safe LLM d…
Jindong Li, Ying Liu, Yali Fu, Jinjing Zhu +3 more
The paper proposes SRTJ, a Self-Evolving Rule-Driven Training-Free Jailbreak framework that systematically discovers and refines attack strategies using rule composition and feedback to achieve robust…
The paper systematically maps LLM agent vulnerabilities by testing 10,000 prompt variations, finding that 'goal reframing' language is the primary trigger for exploitation, rather than broad adversari…
Bowen Sun, Chaozhuo Li, Yaodong Yang, Yiwei Wang +1 more
TwinGate introduces a stateful dual-encoder defense framework using Asymmetric Contrastive Learning to detect malicious intent from fragmented, untraceable LLM queries with high recall and low false p…
Luoyu Chen, Weiqi Wang, Zhiyi Tian, Feng Wu +2 more
The paper proposes Ellipsoid Control, a white-list defense mechanism that uses benign data geometry to constrain model updates, thereby enhancing jailbreak safety while preserving the utility of harml…