~ similar to 2604.18756v1· 19 results
Feiyue Xu, Hongsheng Hu, Chaoxiang He, Sheng Hang +8 more
This paper introduces Security Cube, a comprehensive, multi-dimensional framework for evaluating LLM robustness against jailbreak attacks, providing a systematic taxonomy and benchmark analysis of exi…
Zheng Fang, Xiaosen Wang, Shenyi Zhang, Shaokang Wang +1 more
The paper introduces Token-Aware Gradient Optimization (TAGO), demonstrating that sparse optimization focusing only on high-gradient audio tokens is sufficient for effective jailbreaking of audio lang…
The paper proposes GUARD-SLM, a token activation-based defense mechanism, to enhance the robustness of Small Language Models (SLMs) against various jailbreak attacks by analyzing and filtering malicio…
The paper introduces Head-Masked Nullspace Steering (HMNS), a novel geometry-aware attack method that achieves state-of-the-art jailbreak success rates by manipulating the internal attention mechanism…
The paper introduces a new adaptive jailbreak attack (JB-GCG) that successfully bypasses the state-of-the-art JBShield defense, and proposes a more robust defense (RTV) based on multi-layer representa…
Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more
The paper investigates multimodal jailbreak robustness across various reasoning paradigms and finds that explicit image-tool interaction significantly improves safety by shifting the model's internal…
Yuan Tian, Bing Hu, Fang Wu, Xiaomin Li +2 more
The paper investigates multimodal jailbreak robustness across various reasoning paradigms and finds that explicit image-tool interaction significantly improves safety by guiding the model's internal r…
Luoyu Chen, Weiqi Wang, Zhiyi Tian, Feng Wu +2 more
The paper proposes Ellipsoid Control, a white-list defense mechanism that uses benign data geometry to constrain model updates, thereby enhancing jailbreak safety while preserving the utility of harml…
This paper provides a unified taxonomy and controlled empirical evaluation of jailbreak attacks and defenses for Large Audio Language Models (LALMs), demonstrating that safety evaluation must consider…
The paper theorizes that aligned LLMs remain jailbreakable due to 'Refusal-Escape Directions' (RED), which are continuous perturbation paths that shift model behavior from refusal to answering, and sh…
Paulo Ricardo Ferreira Neves, Edson Rodrigues da Cruz Filho, Paulo Henrique Eleuterio Falsetti, João Vitor Pavan +6 more
GuardNet is a lightweight, ensemble-based guardrail system using shallow neural networks that provides robust and efficient detection of Prompt Injection and Jailbreak attacks on LLMs, suitable for pr…
Krishiv Agarwal, Ramneet Kaur, Colin Samplawski, Manoj Acharya +5 more
The paper conducts an interpretability-driven safety audit of eight state-of-the-art LLMs, demonstrating that while interpretability-based steering is a powerful auditing tool, model robustness varies…
This paper theoretically analyzes Continuous Adversarial Training (CAT) for LLMs using In-context Learning (ICL) theory, proving that embedding space perturbations effectively enhance robustness again…
Ismail Hossain, Tanzim Ahad, Md Jahangir Alam, Sai Puppala +2 more
This paper addresses the lack of systematic infrastructure for evaluating jailbreak attacks by introducing a large-scale dataset, an automated generation method, and a continuous evaluation metric tha…
The paper theoretically analyzes the properties that optimal sparse autoencoder (SAE) dictionaries must satisfy, deriving constraints that explain observed SAE behaviors like hierarchical splitting an…
The paper introduces Residualized Sparse Autoencoders (ReSAEs) to improve multi-layer interventions in transformers by training each layer on the residual activation, which better preserves cross-laye…
The paper introduces Incremental Completion Decomposition (ICD), a novel jailbreak strategy that successfully bypasses LLM safety mechanisms by eliciting malicious content through a sequence of single…
Yingzi Ma, Zhengyue Zhao, Xiaogeng Liu, Minhui Xue +2 more
MaskForge is a novel, adaptive, black-box attack framework that significantly improves jailbreaking diffusion large language models (dLLMs) by treating red-teaming as an optimized search over reusable…
Luoyu Chen, Weiqi Wang, Zhiyi Tian, Chenhan Zhang +4 more
The paper proposes an unsupervised bi-level adversarial training framework to enhance LLM safety steering, achieving strong zero-shot defense against unseen and evolving jailbreak prompts.